Sequential Action and Beliefs — POMDP at the Center of DSGE

TL;DR

Why it matters. Most DSGE models let agents see the true state of the world, yet almost every real decision — capital valuation, climate, monetary policy — is made under partial observation with feedback from the agent’s own action.
What it proposes. Classify DSGEs along two axes (state observability × state-transition control) and position Partially Observable MDP (POMDP) as the generalization that nests MDP and HMM as special cases.
Headline result. A closed-form two-step solution: (i) an optimal action that is linear in point beliefs, and (ii) a belief update that is a fixed point of an operator mapping observables into beliefs — a modified Kalman filter under linear-Gaussian assumptions.
Limit. The whole construction leans on linearity, Gaussian shocks, and convergence of the belief covariance; regime shifts and non-Gaussian tails break the shortcut, and partially-observable controls are out of scope.
What to keep. The “perception shock” experiment — a 1 % mis-valuation of capital persists for 100+ periods — is a clean template for thinking about belief-driven misallocation in any agent system, including LLM agents that hold internal state they cannot directly verify.

Background

Dynamic Stochastic General Equilibrium (DSGE) models share a recursive Markov structure, but almost all of them assume agents observe the true state — capital, technology, output gap — perfectly. Kim borrows a taxonomy from AI and control theory to make this assumption visible:

	state transition exogenous	state transition endogenous
state observable	Markov Chain (MC)	Markov Decision Process (MDP) — Lucas 1978, Blanchard & Gali 2007
state hidden	Hidden Markov Model (HMM) — Kydland & Prescott 1982, Lorenzoni 2009	Partially Observable MDP (POMDP) — Sargent 1991, Svensson & Woodford 2003/04

The paper argues the POMDP cell is where post-2007 macro actually lives: “trouble in capital valuation” (Gertler & Karadi 2009, Curdia 2008), signal-extraction from endogenous indicators, and central bankers who cannot see the output gap they themselves help shape.

Core Idea

Under POMDP, estimation and control no longer separate — the agent’s action changes the very state she is trying to infer. The paper follows the AI/POMDP tradition (Sondik 1971, Kaelbling et al. 1998): treat the agent’s beliefs as her inner state, and solve for an action rule on that inner state plus a belief-update rule consistent with it.

For a linear rational-expectations system in stacked state $\theta_t = [k_t\ z_t]'$ with observation channel $y_t = -B_{yy}^{-1}B_{yc}c_t - B_{yy}^{-1}B_{y\theta}\theta_t$ , the equilibrium is characterized by

\underbrace{y_t}_{\text{what we observe}} = \underbrace{B_{\tilde{y}c} H_{c\theta}\, \theta_{t|t}}_{\text{what we believe}} + \underbrace{B_{\tilde{y}\theta}\, \theta_t}_{\text{what truly is}}.

An equilibrium is a fixed point of the operator that reverses this relation — mapping observables back to beliefs in a Bayesian direction. Sargent (1991)‘s fixed-point is between perceived and actual laws of motion; Kim’s fixed-point is instantaneous, between observations and beliefs, which relaxes the requirement that predictable parts and residuals coincide.

Method

3.1 Solution step 1 — optimal action as a function of beliefs

Conjecture $c_t = H_{c\theta}\, \theta_{t|t}$ (certainty-equivalent on point beliefs). Substitute into the optimality and transition equations, take conditional expectations, and match undetermined coefficients. Whenever the corresponding MDP problem has a Blanchard-Kahn / Klein solution, the same $H_{c\theta}$ satisfies the POMDP quadratic — the action rule survives the loss of observability.

3.2 Solution step 2 — optimal sequential beliefs

Take the action rule as given and apply least mean-squared-error. Conjecture a Kalman-style update $\theta_{t|t} - \theta_{t|t-1} = \tilde{M}_t (y_t - y_{t|t-1})$ , derive $\tilde{M}_t$ from MSE minimization, and require closed-loop consistency $\tilde{M}_t = M_t$ . The result is a modified Kalman recursion:

\begin{aligned} \theta_{t+1|t} &= (B_{\tilde\theta c} H_{c\theta} + B_{\tilde\theta\theta})\, \theta_{t|t} \\ \theta_{t|t} &= \theta_{t|t-1} + M_t (y_t - y_{t|t-1}) \\ \hat\Sigma_{t+1} &= B_{\tilde\theta\theta} \Sigma_t B_{\tilde\theta\theta}' + H_{\theta\omega}\Sigma_\omega H_{\theta\omega}' \\ \Sigma_t &= (I - M_t G_t')\, \hat\Sigma_t. \end{aligned}

The modification relative to standard Kalman: the gain $M_t$ is coupled to the optimal action coefficients $H_{c\theta}$ through $G_t' = \{I + B_{\tilde{y}c} H_{c\theta}\,\mathbf{m}_t\} B_{\tilde{y}\theta}$ — inference and control no longer decouple.

3.3 MDP and HMM as special cases

If $r_y = r_\theta$ with $H_{y\theta}$ full-rank, the state is instantaneously invertible from $y_t$ and POMDP collapses to MDP.
If only the exogenous shocks are hidden, $k_t = k_{t|t}$ , and the recursion reduces to the standard Kalman filter of HMM.

So the POMDP solution is the unifier, not a separate object.

Experiments

Kim calibrates the textbook Neoclassical growth model at quarterly frequency ( $\alpha=0.36$ , $\beta=0.99$ , $\delta=0.025$ , $\rho=0.99$ ) and builds three variants — MDP, HMM (transitory + permanent technology, observable capital), POMDP (also hidden capital). Three findings worth carrying:

Same action coefficients, different state vector. $H_{c\theta}$ numerically coincides across the three variants — the controls are identical functions, but of the true state under MDP, of beliefs about technology under HMM, and of beliefs about everything under POMDP.
Convergence is dramatically slower under POMDP. Reaching $10^{-7}$ on the covariance takes 12 iterations under HMM but 221 under POMDP with the same initial condition. At convergence, the two filters agree (output surprise is attributed 61.6 % to permanent, 38.4 % to transitory technology), but the transient path is very different.
Perception shock as a proxy for the pre-convergence regime. A 1 % positive mis-valuation of capital (the agent thinks she is richer than she is) triggers a consumption spree → capital decumulation → persistent undervaluation of permanent technology. Consumption takes more than 100 periods to return to steady state without any further shocks.

The comparison is internal and non-empirical — no baselines, no data, no seeds — so read the numbers as properties of the solution algorithm, not empirical claims.

Limitations

Linearity and Gaussianity are load-bearing. Weitzman (2007) and Geweke (2001) show CRRA-plus-normal can hide equity-premium-like puzzles; the same caution applies to any POMDP impulse-response built on this assumption.
Convergence of $\hat\Sigma_t$ is assumed. When the structural environment changes (regime switch, learning, structural break), the steady-state filter is a bad approximation for exactly the periods that matter most.
Partially-observable controls (trembling hand) are explicitly out of scope.
Only the linear-quadratic-Gaussian POMDP is tackled analytically; non-linear POMDPs in the AI sense (belief-space planning over continuous non-Gaussian beliefs) remain numerical.

Sapiens Q Take

Two pointers we want to carry into our own agent-simulation work:

The fixed-point as “observables → beliefs” is a cleaner equilibrium concept than “perceived ↔ actual law of motion” whenever we can write down an observation channel. For LLM-agent societies where agents receive noisy transcripts rather than full world state, this is the right primitive to reason with.
The perception-shock experiment is a template, not a result. A one-shot mis-valuation of a hidden endogenous state produces a protracted, self-reinforcing misallocation without any further shocks. Any multi-agent simulation where agents maintain internal estimates of each other’s types or of a shared latent variable should expect similar drift — and should be evaluated on how fast (or whether) the drift is corrected, not only on steady-state behavior.

References

Original paper: Kim, S.-H. (2012). Sequential Action and Beliefs Under Partially Observable DSGE Environments. Computational Economics 40(3), 219–244.
Foundational POMDP: Sondik, E. (1971). The optimal control of partially observable Markov decision processes. PhD thesis, Stanford.
Fixed-point-of-perceived-laws-of-motion ancestor: Sargent, T. (1991). Equilibrium with signal extraction from endogenous variables. JEDC 15(2).
Related filtering work: Svensson & Woodford (2003, 2004); Baxter, Graham & Wright (2011).