Lecture 2: Dynamic Programming ¶
Antoine Chapel (Sciences Po & PSE) ¶
Alfred Galichon's math+econ+code prerequisite class on numerical optimization and econometrics, in Python ¶
Class content by Antoine Chapel. Past and present support from Alfred Galichon's ERC grant CoG-866274 is acknowledged, as well as inputs from contributors listed here. If you reuse material from this class, please cite as:
Antoine Chapel, 'math+econ+code' prerequisite class on numerical optimization and econometrics, January 2023
References¶
- A first course in Optimization theory (Sundaram, R.), 1996
- Further Mathematics for Economic Analysis 2nd Ed (Sydsaeter et al), 2008
- Dynamic Programming and Optimal Control 4th Ed (Bertsekas), 2018 (Advanced)
Introduction and motivation¶
A dynamic programming problem is an optimization problem in which decisions are taken sequentially over several time periods (Sundaram).
Typically, a dynamic optimization problem arises when actions today have consequences tomorrow. You have probably solved dynamic programming problems already, at least in problems with a finite horizon.
The specificity of DP problems is that they are set up in discrete time. Time is index with $t = 1,2,3...,T/\infty$. Two main subtypes of DP problems exist:
- Finite-Horizon Dynamic Programming (FHDP), in which time stops at $T < \infty$
- Infinite-Horizon/Stationary Dynamic Programming (SDP), in which the agents plans as if she were immortal.
Many interesting problems (such as matching), have a dynamic counterpart. Modelling, solving or estimating such problems leads in itself to technical and intellectual challenges. You should have a good understanding of dynamic programming to tackle any graduate macroeconomics class.
Outline¶
- FHDP: general setup of the problem
- How to solve an FHDP
- SDP: general setup
Here is the general form of a FHDP problem:
\begin{align} \max_{\{u_t\}_{t=0}^T \in U} & \, \sum_{t=0}^T f(x_t, u_t) \\ s.t: & \ x_{t+1} = g(x_t, u_t) \\ & \ x_0 \hspace{5pt} \text{given} \end{align}
$x_t$ is the state variable. A key rule of DP is that the state should contain all the information available and necessary for the agent to make the optimal decision. In the simple problems we will tackle here, $x_t$ will be a real number. $x_0$ is given: it is the inital state.
$u_t$ is the decision, called the control of the agent. It belongs to a set $U$. For example, in the consumption problem we are about to describe, $u$ represents consumption, and it cannot be higher than the total amount of resource available (if we exclude credit of course).
$g$ is the transition function: taking the state $x_t$ at $t$ and the decision $u_t$ at $t$ (and time) as input, it outputs the new state.
$f$ is the objective function. In economics, it is typically some measure of utility, profit, or welfare in general. It depends on time, the state, and the decision taken by the agent.
Example: dynamic consumption¶
This is the "Hello World" of Dynamic Programming for economists. $x_t$ is the agent's wealth at time $t$. The control $u_t$ is the proportion of wealth that the agent consumes: $c_t = u_t x_t$. The remaining $(1 - u_t) x_t$ are invested and come back with an interest rate at $x_{t+1} = \rho (1 - u_t) x_t$.
How to solve a FHDP problem.¶
Given the assumption that $x_t$ contains all necessary information, it should be possible, at any $t$, to define by $V(x_t)$ the value of being at $x$, at $t$. Let's suppose that tomorrow, at $t=s+1$, we know what the optimal control $u_t^*$ should be for $t = s+1, s+2,...,T$. Then, we could write: \begin{align} V(x_{s+1}) = \sum_{t=s+1}^T f(x_t, u_t^*) \end{align}
Now, let's go back one day: today, at $t=s$. If the agent knows $V(x_{s+1})$ and the transition function $x_{s+1} = g(x_{s}, u_{s})$, the only thing the agent has to do is to find the control $u_{s}$ that maximizes $$f(x_{s}, u_{s}) + V(g(x_{s}, u_{s}))$$
Finding this optimal control today is a static problem it only depends on present variables: the agent tries to find the control that maximizes his welfare today and brings him to the most favourable state tomorrow (the one where the value function V is highest).
Then, today if the agent knows $V(x_{s+1})$ and $\{u_t^*\}_{s+1}^T$, he can determine $u_s^*$. Then, he can compute $V(x_s)$, which allows him to go still one day earlier, determine the optimal control then, and so on. This is the fundamental equation of Dynamic Programming, that allows us to write sequentially what would otherwise be a diffult problem:
\begin{align} V(x_t) = max_{u_t \in U} \Big[ f(x_t, u_t) + V( x_{t+1} ) \Big] \end{align}
To solve this type of problem, you need to start from the last point in time: at $t=T$ there is no tomorrow, so the agent just optimizes for today: $V(x_T) = max_{u_T \in U} f(x_T, u_T)$. Then, you go one day before and solve for $u_{T-1}^*$ by optimizing as shown. This process of moving backwards in time to solve a sequential problem is called backwards induction.
Exercise: solve the following FHDP problem (taken from Sydsaeter et al): \begin{align} max_u \sum_{t=0}^2 (1 + x_t - u_t^2), \\ x_{t+1} &= x_t + u_t, \\ x_0 &= 0, \\ u_t &\in \mathbb{R} \end{align}
$t=2: \hspace{10pt} V(x_T) = \max_{u_2} 1 + x_2 - u_2^2$
Clearly, $u_2^* = 0$. So, $V(x_T) = 1 + x_2^*$
$t=1: \hspace{10pt} V(x_1) = \max_{u_1} 1 + x_1 - u_1^2 + V(x_T)$
$ \hspace{60pt} = \max_{u_1} 1 + x_1 - u_1^2 + 1 + x_2$
$ \hspace{60pt} = \max_{u_1} 1 + x_1 - u_1^2 + 1 + x_1 + u_1$
FOC: $-2u_1 + 1 = 0 \Leftrightarrow u_1^* = \frac{1}{2}$ (Concavity is trivial)
$\Rightarrow V(x_1) = 1 + x_1^* - \frac{1}{4} + 1 + x_1^* + \frac{1}{2} = 2 x_1^* + \frac{9}{4}$
$t=0: \hspace{10pt} V(x_0) = \max_{u_0} 1 + x_0 - u_0^2 + 2x_1^* + \frac{9}{4}$
$ \hspace{60pt} = \max_{u_0} 1 + x_0 - u_0^2 + 2(x_0 + u_0) + \frac{9}{4}$
FOC: $-2u_0 + 2 = 0 \Leftrightarrow u_0^* = 1$ (Concavity is trivial again)
So, given the initial state $x_0=0$, the optimal sequence of actions and states is the following:
- $t=0$: $x_0=0$, $u_0^* = 1$
- $t=1$: $x_1=1$, $u_1^* = \frac{1}{2}$
- $t=2$: $x_2=\frac{3}{2}$, $u_2^* = \frac{1}{4}$
Stationary Dynamic Programming¶
Conceiving dynamic programming at an infinite horizon requires a bit of a conceptual jump from finite horizon dynamic programming.
First, why is it called "stationary" ? If an agent lives forever, he is free from a strategic use of the time, and therefore he will find an optimal consumption level, and a consumption rule that allows him to reach that optimal consumption level and, in a way, remain there.
General setup¶
\begin{align} \max_{\{u_t\}_{t=0}^T \in U} & \, \sum_{t=0}^{\infty} \beta^t f(x_t, u_t) \\ s.t: & \ x_{t+1} = g(x_t, u_t) \\ & \ x_0 \hspace{5pt} \text{given} \end{align}
This is very close to the finite setting. The modifications is that now, the agent optimizes an infinite sum. To avoid that this infinite sum diverges to $+\infty$, the future flows of utility are discounted with $\beta \in (0,1)$. This way, at moment $t$, if $\beta=0.8$, utility today is weighted $\beta^0 = 1$, utility tomorrow is weighted $\beta^1 = 0.8$, after tomorrow $\beta^2 = 0.64$, etc. The higher $\beta$, the more patient the agent is.
Since the behaviour is stationary, the goal of SDP is to find a fixed policy function $h$, a function that takes the state as input, and returns a value for the control. The goal is that, once this policy function is known, the whole system can be simulated with the policy function and the transition function: $u_t = h(x_t)$ and $x_{t+1} = g(x_t, u_t)$, starting from any $x_0$ given.
Rewriting the SDP problem as a sequential problem:¶
Let us suppose that we are at time $t = 0$. The problem that the agent has to solve is the following:
\begin{align} V(x_0) = \max & \, \sum_{t=0}^{\infty} \beta^t f(x_t, u_t) \\ = \max & \, f(x_0, u_0) + \sum_{t=1}^\infty \beta^t f(x_t, u_t) \\ = \max & \, f(x_0, u_0) + \beta \sum_{t=1}^\infty \beta^{t-1} f(x_t, u_t) \\ = \max & \, f(x_0, u_0) + \beta V(x_1) \end{align}
Since the problem is stationary, the time indexes do not matter: we can denote $x$ for any $x_t$ and $x'$ for any $x_{t+1}$
\begin{align} V(x) = \max f(x, u) + \beta V(x') \end{align}
Value Function Iteration¶
Solving this sort of problem requires a numerical procedure called Value Function Iteration. It is instrumental in macro models such as DSGE. Here, we will see the non-stochastic version of that model. Consider you want to maximize the following problem:
\begin{align} \max_{\{c_t\}_{t=0}^{\infty}} & \, \sum_{t=0}^{\infty} \beta^t u(c_t) \\ s.t: & \ k_{t+1} + c_{t} = f(k_t) \\ & \ k_0 \hspace{5pt} \text{given} \end{align}
At every period, the agents "looks" at his stock of capital, and decides how much to consume based on this. So, the "state" is capital $k$. The problem can be written recursively as follows:
\begin{align} V(k_0) = \max_{\{0 < k_1 < f(k_0)\}_{t=0}^{\infty}} \{ u\Big(f(k_0) - k_1\Big) + \beta V(k_1)\} \end{align}
Since the problem is infinite horizon, the solution (if it exists) is stationary: what is valid for some $k_0$ and $k_1$ (taken arbitrarily as a starting point between $-\infty$ and $+\infty$, must be valid for any $k$ and $k'$. Also, to have a specific functional form for $f$ and $u$ in mind, we rewrite the problem:
\begin{align} V(k) = \max_{\{0 < k' < A(k^\alpha)\}_{t=0}^{\infty}} \{ \log(Ak^\alpha - k') + \beta V(k')\} \end{align}
import numpy as np
import matplotlib.pyplot as plt
kmax = 30
precision=200
kgrid = np.linspace(1, kmax, precision)
gk = np.linspace(1, kmax, precision)
Vk0 = np.ones(precision)
A=10
α=0.5
β=0.9
norm = 1e5
tol = 1e-5
maxiter=1000
n_iter=0
Vk = Vk0
while n_iter < maxiter and norm > tol:
value_array = np.empty((precision, precision))
for iprim, kprim in enumerate(kgrid):
for i, k in enumerate(kgrid):
c = A*(k**α) - kprim
if c > 0:
value_array[i, iprim] = np.log(c) + β*Vk[iprim]
else:
value_array[i, iprim] = -np.inf
Vkprim = np.empty(precision)
for row in range(value_array.shape[1]):
gk[row] = kgrid[int(np.argmax(value_array[row, :]))]
Vkprim[row] = np.max(value_array[row, :])
norm = np.max(np.abs(Vkprim - Vk))
Vk = Vkprim
n_iter += 1
print("iteration: ", n_iter, " norm: ", norm)
iteration: 1 norm: 3.884757641783117 iteration: 2 norm: 2.882083935801659 iteration: 3 norm: 2.5089908990608283 iteration: 4 norm: 2.2329620751402928 iteration: 5 norm: 2.000399429882627 iteration: 6 norm: 1.7965668564188917 iteration: 7 norm: 1.615285261725461 iteration: 8 norm: 1.4530296087434174 iteration: 9 norm: 1.307415874014488 iteration: 10 norm: 1.1765199082665276 iteration: 11 norm: 1.058803635426532 iteration: 12 norm: 0.9528926646095286 iteration: 13 norm: 0.8575913772309605 iteration: 14 norm: 0.7718319480552651 iteration: 15 norm: 0.694641689662955 iteration: 16 norm: 0.6251775206966599 iteration: 17 norm: 0.5626597686269932 iteration: 18 norm: 0.506393791764296 iteration: 19 norm: 0.455754412587865 iteration: 20 norm: 0.4101789713290813 iteration: 21 norm: 0.36916107419617816 iteration: 22 norm: 0.33224496677656035 iteration: 23 norm: 0.29902047009890254 iteration: 24 norm: 0.2691184230890151 iteration: 25 norm: 0.24220658078011326 iteration: 26 norm: 0.21798592270210193 iteration: 27 norm: 0.19618733043188996 iteration: 28 norm: 0.17656859738870168 iteration: 29 norm: 0.1589117376498379 iteration: 30 norm: 0.1430205638848534 iteration: 31 norm: 0.12871850749636948 iteration: 32 norm: 0.11584665674673289 iteration: 33 norm: 0.10426199107206457 iteration: 34 norm: 0.0938357919648567 iteration: 35 norm: 0.08445221276837245 iteration: 36 norm: 0.076006991491532 iteration: 37 norm: 0.06840629234238094 iteration: 38 norm: 0.061565663108144264 iteration: 39 norm: 0.05540909679733019 iteration: 40 norm: 0.04986818711759966 iteration: 41 norm: 0.04488136840583934 iteration: 42 norm: 0.040393231565253984 iteration: 43 norm: 0.03635390840873498 iteration: 44 norm: 0.03271851756786148 iteration: 45 norm: 0.029446665811072137 iteration: 46 norm: 0.026501999229967765 iteration: 47 norm: 0.023851799306971344 iteration: 48 norm: 0.021466619376276697 iteration: 49 norm: 0.019319957438654 iteration: 50 norm: 0.01738796169478718 iteration: 51 norm: 0.015649165525310593 iteration: 52 norm: 0.014084248972778823 iteration: 53 norm: 0.012675824075500941 iteration: 54 norm: 0.011408241667950847 iteration: 55 norm: 0.010267417501161447 iteration: 56 norm: 0.009240675751048144 iteration: 57 norm: 0.008316608175942264 iteration: 58 norm: 0.007484947358342708 iteration: 59 norm: 0.00673645262251199 iteration: 60 norm: 0.006062807360265765 iteration: 61 norm: 0.005456526624236346 iteration: 62 norm: 0.004910873961815554 iteration: 63 norm: 0.004419786565634354 iteration: 64 norm: 0.003977807909073761 iteration: 65 norm: 0.003580027118168516 iteration: 66 norm: 0.0032220244063516645 iteration: 67 norm: 0.0028998219657196955 iteration: 68 norm: 0.0026098397691498576 iteration: 69 norm: 0.002348855792234872 iteration: 70 norm: 0.0021139702130135163 iteration: 71 norm: 0.001902573191713941 iteration: 72 norm: 0.001712315872545389 iteration: 73 norm: 0.0015410842852929818 iteration: 74 norm: 0.0013869758567679469 iteration: 75 norm: 0.00124827827108831 iteration: 76 norm: 0.0011234504439840975 iteration: 77 norm: 0.0010111053995842667 iteration: 78 norm: 0.00090999485962584 iteration: 79 norm: 0.000818995373663256 iteration: 80 norm: 0.0007370958362997726 iteration: 81 norm: 0.0006633862526683743 iteration: 82 norm: 0.0005970476274015368 iteration: 83 norm: 0.0005373428646642253 iteration: 84 norm: 0.0004836085782002897 iteration: 85 norm: 0.00043524772038239234 iteration: 86 norm: 0.000391722948350548 iteration: 87 norm: 0.00035255065350980885 iteration: 88 norm: 0.0003172955881609596 iteration: 89 norm: 0.0002855660293477058 iteration: 90 norm: 0.0002570094264129352 iteration: 91 norm: 0.0002313084837730628 iteration: 92 norm: 0.00020817763539682232 iteration: 93 norm: 0.00018735987185891645 iteration: 94 norm: 0.00016862388467231426 iteration: 95 norm: 0.0001517614962054381 iteration: 96 norm: 0.000136585346588447 iteration: 97 norm: 0.0001229268119296023 iteration: 98 norm: 0.00011063413073486572 iteration: 99 norm: 9.95707176656424e-05 iteration: 100 norm: 8.96136459047625e-05 iteration: 101 norm: 8.065228131570734e-05 iteration: 102 norm: 7.258705318236025e-05 iteration: 103 norm: 6.532834786554531e-05 iteration: 104 norm: 5.8795513080411865e-05 iteration: 105 norm: 5.2915961774147036e-05 iteration: 106 norm: 4.76243655995745e-05 iteration: 107 norm: 4.286192903890651e-05 iteration: 108 norm: 3.8575736134305316e-05 iteration: 109 norm: 3.47181625244275e-05 iteration: 110 norm: 3.1246346271274206e-05 iteration: 111 norm: 2.812171165089694e-05 iteration: 112 norm: 2.5309540490070503e-05 iteration: 113 norm: 2.2778586441063453e-05 iteration: 114 norm: 2.050072779979928e-05 iteration: 115 norm: 1.845065501981935e-05 iteration: 116 norm: 1.6605589514995245e-05 iteration: 117 norm: 1.4945030564206263e-05 iteration: 118 norm: 1.345052751133835e-05 iteration: 119 norm: 1.2105474759493973e-05 iteration: 120 norm: 1.0894927285676204e-05 iteration: 121 norm: 9.805434554976955e-06
kstar = kgrid[np.argmin(np.abs(gk - kgrid))]
plt.rcParams["figure.figsize"] = (10, 8)
plt.plot(kgrid, gk, label="Optimal k'")
plt.plot(kgrid, kgrid, label="45° line")
plt.vlines(kstar, ymin=0, ymax=30, label="Equilibrium value k*")
plt.title("Value Function Iteration: ")
plt.legend()
plt.show()
How to interpret this ? The blue line describes, for each $k_t$ on the horizontal axis, what should be the optimal $k_{t+1}$. When both coincide, this is the steady state: it gives the equilibrium value of capital $k^*$. The algorithm we used, VFI, is a type of a fixed-point algorithm. We started from a random $Vk0$ (which is not the blue line !) and computed present values pretending $Vk0$ was $V(k')$. Then, we looked at where the present values are the highest (optimization process) for every $k$ (line 36), and replaced our random $VkV0$ by that optimized $V(k')$. Of course it's still wrong, but if we repeat this procedure enough times, we can get arbitrarily close the the true $V(k')$.
Dynamic Programming also has a stochastic extension, in which the state transition involves probabilities. In such settings, the number of possible states can very quickly explode and become intractable: that's called the curse of dimensionality. Algorithms from the Artificial Intelligence literature are strongly connected to the ones we presented here (think of the position of pieces on the chess board as being the state, and the control as being the move that can be made by the chess player).
If you are interested in these topics, you are invited to take a look at Dimitri Bertsekas's website, and especially his work on Dynamic Programming