Modular MDP Foundations¶
This note develops a modular foundation for expressing the Bellman operator found in Markov decision processes (MDPs). The basic building block is the stage — a stochastic decision-making kernel that processes an independent piece of information.
The usual representation of an MDP defines a Bellman operator as an intermediate step in solving a larger, stationary, sequential problem. Alternatively, one finds recursive definitions of a Bellman operator, but without a complete account of how the operator composes with other operators in a non-stationary setting.
The idea of modularity is that individual stages and their Bellman operators can be defined independently of any encompassing problem, and then collected into a structure — a common library of stages can define a variety of structures. In computation this enables reusability of code. It also makes the definition of multi-stage non-stationary problems transparent: one specifies the stages, then specifies the wiring between them that produces the larger problem.
We define stages (each characterised by three information sets and the transitions between them), develop the factored Bellman equation and its operator decomposition, show that the standard MDP is a special case, and build toward composition of stages into periods.
1. The Stage¶
Let \((\Omega, \Sigma, \mathbb P)\) be a probability space.
Definition 1.1 (Stage). A stage \(\mathbb{S}\) consists of:
-
State spaces. Measurable spaces \((\mathsf X_{\prec}, \mathcal X_{\prec})\), \((\mathsf X, \mathcal X)\), and \((\mathsf X_{\succ}, \mathcal X_{\succ})\), called the arrival, decision, and continuation spaces.
-
Filtration. A nested triple of sub-\(\sigma\)-algebras of \(\Sigma\),
representing the information available at each perch.
-
Arrival state. An \(\mathsf X_{\prec}\)-valued random variable \(x_{\prec}\) on \((\Omega, \Sigma, \mathbb P)\), measurable with respect to \(\mathcal F_{\prec}\).
-
Continuation value. A measurable function \(\mathrm{v}_{\succ} \colon \mathsf X_{\succ} \to \mathsf V\), where \((\mathsf V, \mathcal V)\) is a measurable value space.
-
Shock spaces. Measurable spaces \((\mathsf Z_{\prec}, \mathcal Z_{\prec})\) and \((\mathsf Z_{\succ}, \mathcal Z_{\succ})\) with probability measures \(P_\zeta\) and \(P_\eta\). The shock \(\zeta \sim P_\zeta\) is resolved between arrival and decision; the shock \(\eta \sim P_\eta\) is resolved between decision and continuation. Either space may be trivial (\(\mathsf Z = \{\varnothing\}\)), in which case the corresponding argument is suppressed.
-
Transition maps.
-
Action space and feasibility. A measurable space \((\mathsf A, \mathcal A)\) and a nonempty-valued correspondence \(\Gamma \colon \mathsf X \rightrightarrows \mathsf A\).
-
Reward and discount. A measurable function \(\mathrm{r} \colon \mathsf X \times \mathsf A \to \mathbb R\) and a discount factor \(\beta \in [0,1)\).
-
Backward movers. Two mappings that transform a value function at one perch into a value function at the adjacent perch:
- The decision mover \(\mathbb{B}\) takes \(\mathrm{v}_{\succ} \colon \mathsf X_{\succ} \to \mathsf V\) and produces \(v \colon \mathsf X \to \mathsf V\) via \(v = \mathbb{B}\,\mathrm{v}_{\succ}\).
- The arrival mover \(\mathbb{I}\) takes \(v \colon \mathsf X \to \mathsf V\) and produces \(\mathrm{v}_{\prec} \colon \mathsf X_{\prec} \to \mathsf V\) via \(\mathrm{v}_{\prec} = \mathbb{I}\,v\).
The composite \(\mathbb{T} := \mathbb{I} \circ \mathbb{B}\) is the stage operator, mapping a continuation value to an arrival value.
Remark (movers as edges). A mover is defined by its action on a single value function — it is an edge connecting two perches, not an operator on a function space. This pointwise view matches the graph structure of the stage: perches are nodes, movers are directed edges. The lifting to an operator on the space \(\mathscr B(\mathsf X, \mathsf V)\) is a derived object used when studying fixed points and contraction (§4).
A policy is a measurable map \(\pi \colon \mathsf X \to \mathsf A\) satisfying \(\pi(x) \in \Gamma(x)\) for all \(x\), adapted to \(\mathcal F\). Given a policy, the state transitions within a stage are
so that \(x\) incorporates the arrival state and the pre-decision shock, and \(x_{\succ}\) incorporates the decision state, the chosen action, and the post-decision shock. We require \(\sigma(x_{\succ}) = \mathcal F_{\succ}\), i.e. all information at the continuation perch is captured by the continuation state.
The data of a stage group naturally into three perches (nodes) and two movers (edges). Each perch collects the state, value, and problem primitives measurable at one information set. Each mover is a mapping between adjacent perches — forward in the transition direction, backward in the value direction.
Perches (nodes):
| Perch | State | Value | Problem primitives | Syntax tag |
|---|---|---|---|---|
| Arrival | \(x_{\prec} \in \mathsf X_{\prec}\) | \(\mathrm{v}_{\prec}\) | \(P_\zeta\) | [<] |
| Decision | \(x \in \mathsf X\) | \(v\) | \(\Gamma\), \(\mathrm{r}\), \(\beta\), \(P_\eta\) | (unmarked) |
| Continuation | \(x_{\succ} \in \mathsf X_{\succ}\) | \(\mathrm{v}_{\succ}\) (input) | — | [>] |
Movers (edges):
| Edge | Forward (transition) | Backward (value) |
|---|---|---|
| Arrival → Decision | \(x = \mathrm{g}_{\prec\sim}(x_{\prec}, \zeta)\) | \(\mathrm{v}_{\prec} = \mathbb{I}\,v\) |
| Decision → Continuation | \(x_{\succ} = \mathrm{g}_{\sim\succ}(x, a, \eta)\) | \(v = \mathbb{B}\,\mathrm{v}_{\succ}\) |
The preferred syntax tags are [<] (arrival) and [>] (continuation); the decision perch is unmarked. These are measurability declarations: x[<] asserts \(\mathcal F_{\prec}\)-measurability, a bare x asserts \(\mathcal F\)-measurability, and x[>] asserts \(\mathcal F_{\succ}\)-measurability. Arrow tags ([<-], [-], [->]) and named tags ([_arvl], [_dcsn], [_cntn]) remain accepted aliases.
Remark. The additive reward structure \(\mathrm{r}(x, a) + \beta\, \mathrm{v}_{\succ}\) can be generalised; the essential requirement is that the stage operator \(\mathbb{T}\) maps continuation value functions to arrival value functions.
1.1 The category of stages¶
A stage consumes a continuation space and produces an arrival space.Each stage contributes three perches and two movers. A period — an ordered collection of stages — gives a category by taking the union of all perches (nodes) and movers (directed edges).
We write \(\prec^{(j)}\), \(\cdot^{(j)}\), \(\succ^{(j)}\) for the arrival, decision, and continuation perches of stage \(j\).
Definition 1.2 (Period category). A period \(\mathcal P\) consisting of stages \(\mathbb S_0, \ldots, \mathbb S_{K-1}\) defines a category \(\mathbf{Per}(\mathcal P)\):
- Objects: perches — each carrying its state space, value function, filtration, and problem primitives (as in the table above).
- Morphisms: the forward transitions and backward movers of each stage, together with identifications \(\succ^{(j)} = \prec^{(j+1)}\) whenever consecutive stages share a common boundary.
The shared boundary perches are what make the period a connected structure. When \(\succ^{(j)} = \prec^{(j+1)}\), the movers compose through it.
Linear period (two stages, shared boundary perch):
┌────── 𝕊₀ ───────┐ ┌────── 𝕊₁ ───────┐
g≺○ g○≻ g≺○ g○≻
≺⁰ ───→ ·⁰ ───→ ≻⁰ ══shared══ ≺¹ ───→ ·¹ ───→ ≻¹
←┄┄┄ ←┄┄┄ ←┄┄┄ ←┄┄┄
𝕀 𝔹 𝕀 𝔹
Each node is a full perch (state, value, filtration, primitives). Solid arrows: forward transitions (\(\mathrm{g}_{\prec\sim}\), \(\mathrm{g}_{\sim\succ}\)), left-to-right. Dashed arrows: backward movers (\(\mathbb{I}\), \(\mathbb{B}\)), right-to-left. The double line marks \(\succ^{(0)} = \prec^{(1)}\).
Branching period (stage 0 followed by a branching stage with two successors):
𝕊₀ 𝕊₁
g≺○ g○≻ g≺○ g○≻
≺⁰ ───→ ·⁰ ───→ ≻⁰ ═══ ≺¹ ───→ ·¹ ──┬─ p(a) ─→ ≻¹ₐ ═══ ≺ᵃ ─── 𝕊ₐ ───→ ≻ᵃ
←┄┄┄ ←┄┄┄ ←┄┄┄ │
𝕀 𝔹 𝕀 └─ p(b) ─→ ≻¹ᵦ ═══ ≺ᵇ ─── 𝕊ᵦ ───→ ≻ᵇ
Here stage \(\mathbb S_1\) is a branching stage: the probabilities \(\mathrm{p}(a \mid x, \pi)\), \(\mathrm{p}(b \mid x, \pi)\) are part of \(\mathbb S_1\)'s internal \(\mathrm{g}_{\sim\succ}\) transition, which routes from the decision perch \(\cdot^{(1)}\) to multiple continuation perches \(\succ^{(1)}_a\), \(\succ^{(1)}_b\). The connections to successor stages are shared boundary perches (\(\succ^{(1)}_a = \prec^{(a)}\), \(\succ^{(1)}_b = \prec^{(b)}\)), just as in the linear case.
Remark (names are syntax, not structure). Objects of \(\mathbf{Per}\) carry bare measurable spaces — no variable names. Concrete names (m, a, c) live in the YAML syntax. The meaning map \(\Upsilon\) compiles named declarations into bare spaces by taking products: prestate: { m: "@in R++", k: "@in R+" } becomes \(\mathsf X_{\prec} = \mathbb R_{++} \times \mathbb R_+\). The category sees only the perches and the movers between them, not the names.
1.2 Shock configurations¶
The two shock spaces determine four canonical configurations.
| Configuration | \(\mathsf Z_{\prec}\) | \(\mathsf Z_{\succ}\) | Structure |
|---|---|---|---|
| Pre-decision only | non-trivial | \(\{\varnothing\}\) | \(x = \mathrm{g}_{\prec\sim}(x_{\prec}, \zeta)\), \(\; x_{\succ} = \mathrm{g}_{\sim\succ}(x, a)\) |
| Post-decision only | \(\{\varnothing\}\) | non-trivial | \(x = x_{\prec}\), \(\; x_{\succ} = \mathrm{g}_{\sim\succ}(x, a, \eta)\) |
| Both | non-trivial | non-trivial | General case |
| Deterministic | \(\{\varnothing\}\) | \(\{\varnothing\}\) | Pure optimisation |
In the pre-decision case, \(\zeta\) (e.g. an interest rate) is realised between arrival and decision; the agent observes it before choosing, and \(\mathrm{g}_{\sim\succ}\) is deterministic. In the post-decision case, \(\mathrm{g}_{\prec\sim}\) is the identity and \(\eta\) (e.g. asset returns) enters only at continuation. When both shocks are non-trivial, the composite transition is \(x_{\succ} = \mathrm{g}_{\sim\succ}(\mathrm{g}_{\prec\sim}(x_{\prec}, \zeta),\, a,\, \eta)\).
2. The backward Bellman Recursion¶
We can now put some structure on the movers to define a Bellman equation. Let \(\mathrm{v}_{\succ} \colon \mathsf X_{\succ} \to \mathbb R\) be a given continuation value function. The stage defines value functions on each perch by backward induction.
Decision value. At the decision perch the agent solves
When \(\mathsf Z_{\succ}\) is trivial this reduces to \(v(x) = \sup_{a \in \Gamma(x)}\{\mathrm{r}(x, a) + \beta\, \mathrm{v}_{\succ}(\mathrm{g}_{\sim\succ}(x, a))\}\).
Arrival value. Integrating over the pre-decision shock,
When \(\mathsf Z_{\prec}\) is trivial, \(\mathrm{v}_{\prec} = v\).
These two equations constitute the canonical Bellman recursion on perches. They make explicit which randomness sits inside vs. outside the supremum, and what the conditioning \(\sigma\)-algebra is at each step. The continuation value \(\mathrm{v}_{\succ}\) enters as an arbitrary element of \(\mathscr B(\mathsf X_{\succ})\); the stage transforms it into an arrival value \(\mathrm{v}_{\prec} \in \mathscr B(\mathsf X_{\prec})\). This input–output structure is what gives the stage its modular character.
2.2 The Factored Bellman Operator¶
The Bellman recursion of §2 gives the canonical content of the backward movers \(\mathbb{B}\) and \(\mathbb{I}\) introduced in Definition 1.1.
Definition 2.2 (Bellman operators). Let \(\mathscr B(\mathsf Y)\) denote the space of bounded measurable functions on a measurable space \(\mathsf Y\).
The decision mover \(\mathbb{B} \colon \mathscr B(\mathsf X_{\succ}) \to \mathscr B(\mathsf X)\) is
The arrival mover \(\mathbb{I} \colon \mathscr B(\mathsf X) \to \mathscr B(\mathsf X_{\prec})\) is
In standard value function iteration the stage operator is \(\mathbb{T} = \mathbb{I} \circ \mathbb{B}\). Alternative solution methods (policy iteration, marginal value iteration, EGM) replace \(\mathbb{B}\) with a different operator that has the same signature \(\mathscr B(\mathsf X_{\succ}) \to \mathscr B(\mathsf X)\) but evaluates the recursion differently. The arrival mover \(\mathbb{I}\) (expectation) is typically unchanged.
The factored Bellman operator is the composition
Remark (finer factorisations). The two-part factorisation \(\mathbb{T} = \mathbb{I} \circ \mathbb{B}\) only claims that there is one operator per inter-perch edge that projects over information. Each operator may itself decompose further. For example, \(\mathbb{B}\) combines an expectation and a maximisation:
Here \(\mathrm{Q} \colon \mathsf X \times \mathsf A \to \mathbb R\) is the state-action value function, and \(\mathbb{B}\) factors as expectation (\(\mathrm{v}_{\succ} \mapsto \mathrm{Q}\)) followed by maximisation (\(\mathrm{Q} \mapsto v\)). Solution methods that work with \(\mathrm{Q}\) directly (e.g. Q-learning, or methods that iterate on marginal values) can be seen as replacing \(\mathbb{B}\) with an operator that targets \(\mathrm{Q}\) rather than \(v\). The stage framework is agnostic to this choice — it requires only that \(\mathbb{B}\) have the signature \(\mathscr B(\mathsf X_{\succ}) \to \mathscr B(\mathsf X)\).
3. Forward Operators¶
The backward movers of §2 act on value functions. Their conjugate forward operators act on measures, evolving a population distribution through the stage under a fixed policy \(\pi^*\).
3.1 Measure evolution¶
Let \(\mu_{\prec}\) denote the distribution of the arrival state \(x_{\prec}\), i.e. \(\mu_{\prec}(B) = \mathbb P(x_{\prec} \in B)\) for \(B \in \mathcal X_{\prec}\).
Arrival to decision. The forward operator \(\mathrm{F}_{\prec}\) pushes \(\mu_{\prec}\) forward through the pre-decision shock to produce the decision-state distribution \(\mu = \mathrm{F}_{\prec}\,\mu_{\prec}\):
Decision to continuation. Under the policy \(\pi^*\), the forward operator \(\mathrm{F}_{\succ}\) pushes the decision-state distribution \(\mu\) forward to produce the continuation-state distribution \(\mu_{\succ} = \mathrm{F}_{\succ}\,\mu\):
When \(\mathsf Z_{\succ}\) is trivial the kernel reduces to an indicator: \(\mathbf 1\{\mathrm{g}_{\sim\succ}(x, \pi^*(x)) \in C\}\).
3.2 Conjugate duality¶
For any \(v \in \mathscr B(\mathsf X)\) and probability measure \(\mu_{\prec}\) on \(\mathsf X_{\prec}\),
where \(\langle \mu, f \rangle = \int f\, d\mu\). The same duality holds for the pair \((\mathrm{F}_{\succ}, \mathbb{B})\).
ambiguity¶
5. Relationship to the Standard MDP¶
A single stage with the pre-decision-only configuration yields the standard MDP.
Let \((\mathsf X_t, \mathcal X_t)\), \((\mathsf A_t, \mathcal A_t)\), \(\Gamma_t\), \(\mathrm{r}_t\), \(\beta_t\), and \(\mathrm{P}_t(\cdot \mid x, a)\) be the primitives of a non-stationary MDP with Bellman equation
Set \(\mathsf X = \mathsf X_t\). The transition kernel is recovered by composing \(\mathrm{g}_{\sim\succ}\) and \(\mathrm{g}_{\prec\sim}\) across consecutive stages and integrating out the shock:
The stage is thus a refinement of the standard MDP: it makes explicit the within-period information sets that the kernel \(\mathrm{P}_t\) leaves implicit.
Remark 5.1 (Markov sufficiency). The decision state \(x\) must incorporate all payoff- and transition-relevant information available at decision time. If a variable (e.g. an interest rate) affects the distribution of future shocks, it must appear in \(x\). When this condition fails, the model requires belief-state augmentation and becomes a POMDP. Writing \(x\) without a perch tag (decision perch is unmarked) enforces this: it declares that \(x\) is \(\mathcal F\)-measurable.
Remark 5.2 (No additional structure). The Arrival–Decision–Continuation decomposition rests on two properties already standard in the DP literature: (i) policies are adapted to information, and (ii) randomness can enter before or after the action, provided the decision state is Markov-sufficient. The stage imposes no structure beyond what is implicit in any well-posed MDP.
6. Stage Composition¶
A stage defines only its internal transitions \(\mathrm{g}_{\prec\sim}\) and \(\mathrm{g}_{\sim\succ}\). The connection between stages is specified externally by connectors. A period is an ordered sequence of stages \(\{s_0, s_1, \ldots, s_{K-1}\}\) together with connectors \(\mathrm{c}_0, \ldots, \mathrm{c}_{K-2}\), sharing a common time index and variable namespace.
6.1 Connectors¶
Definition 6.1 (Connector). For sequential stages \(j\) and \(j+1\), a connector is a measurable map
linking the continuation space of stage \(j\) to the arrival space of stage \(j+1\).
Under backward induction the continuation value of stage \(j\) is assigned via
Under forward simulation the distribution evolves as
This separation — internal structure defined by the stage, external wiring by the connector — is what permits stages to be reused in different model configurations.
6.2 Periods¶
The composite Bellman operator is
read right-to-left. Each stage can be replaced independently, provided the connectors remain well-defined.
6.3 Branching¶
A branching stage admits transitions to multiple successor stages. Let \(\mathsf N_+\) be the set of successor indices, with state-dependent probabilities \(\mathrm{p}(j \mid x, a)\) summing to one over \(j \in \mathsf N_+\). The decision value becomes
where \(\mathrm{v}_{\prec}^{(j)}\) is the arrival value of successor \(j\) and \(\mathrm{c}_j \colon \mathsf X \times \mathsf A \to \mathsf X_{\prec}^{(j)}\) is the branching connector. The forward dynamics split accordingly:
with mass conserved: \(\sum_{j \in \mathsf N_+} \mu_{\prec}^{(j)}(\mathsf X_{\prec}^{(j)}) = \mu(\mathsf X)\).
4. Contraction and Fixed Point¶
The factored Bellman operator \(\mathbb{T}\) maps \(\mathscr B(\mathsf X_{\succ})\) into \(\mathscr B(\mathsf X_{\prec})\). In the stationary case, the arrival and continuation spaces are identified via a connector \(\mathrm{c} \colon \mathsf X_{\succ} \to \mathsf X_{\prec}\) (see §6), yielding a self-map
Proposition 4.1. Suppose rewards are bounded, \(\beta < 1\), action spaces are compact, and \(\mathrm{r}\), \(\mathrm{g}_{\sim\succ}\), and \(\Gamma\) are continuous. Then \(\hat{\mathbb{T}}\) satisfies Blackwell's sufficient conditions for a contraction on \((\mathscr B(\mathsf X_{\succ}), \|\cdot\|_\infty)\) with modulus \(\beta\):
By the Banach fixed-point theorem there exists a unique \(\mathrm{v}_{\succ}^*\) satisfying \(\hat{\mathbb{T}}\, \mathrm{v}_{\succ}^* = \mathrm{v}_{\succ}^*\), and iteration converges geometrically:
7. Solution Methods¶
Not sure we need this here.
The factored Bellman equation \(v = \mathbb{B}\, \mathrm{v}_{\succ}\) admits several solution strategies. The stage — its spaces, transitions, rewards, and constraints — is invariant across methods; what varies is the numerical treatment of \(\mathbb{B}\).
Value function iteration. Iterate \(\mathrm{v}_{\succ}^{(n+1)} = \hat{\mathbb{T}}\, \mathrm{v}_{\succ}^{(n)}\) until convergence.
First-order conditions. When \(\mathrm{r}\) and \(\mathrm{v}_{\succ}\) are differentiable and the problem is concave, the interior optimum satisfies the Euler equation
Endogenous grid methods. EGM inverts the Euler equation: given a grid on \(\mathsf X_{\succ}\), it recovers the decision state and action consistent with optimality, avoiding root-finding.
Policy iteration. Alternate between policy evaluation (solving the linear system implied by a fixed policy) and policy improvement (updating the policy given the new value function).
Each method computes different objects at the perches — value functions, marginal values, policy functions — but all solve the same mathematical problem. The perch structure specifies what information is available at each point; it is silent on which computational objects are stored there.
References¶
- Powell, W. B. (2011). Approximate Dynamic Programming, 2nd ed. Wiley.
- Puterman, M. L. (2005). Markov Decision Processes. Wiley.
- Sargent, T. J. and Stachurski, J. (2025). Dynamic Programming, vol. 2. Manuscript.
- Stachurski, J. (2009). Economic Dynamics. MIT Press.
- Stachurski, J. (2022). Economic Dynamics, 2nd ed. MIT Press.
- Stokey, N. and Lucas, R. E. (1989). Recursive Methods in Economic Dynamics. Harvard University Press.