Planning & Control = Inference

🧠 TL;DR

Planning and control can be framed as Bayesian inference: by treating costs as (unnormalized) likelihoods and dynamics as priors, decision-making reduces to inferring a posterior over optimal trajectories.

Nearly all planning, optimal control, and reinforcement learning algorithms can be derived as variational inference problems of the form:

\[\min_Q\;\mathbb{E}_Q[C(\tau)] + \mathrm{KL}(Q(\tau) \,\|\, P(\tau))\]

by varying:

the form of the prior $P(\tau)$ — e.g. passive dynamics, old policy, or expert behavior;
the expressivity of the variational family $Q(\tau)$ — e.g. deterministic plans, stochastic policies, hierarchical structures;
the approximation scheme — e.g. sampling, linearization, or taking a low-temperature limit.

This objective comes from the free energy functional:

\[\mathcal{F}[Q] = \mathbb{E}_Q[C(\tau)] + \mathrm{KL}(Q(\tau) \,\|\, P(\tau))\]

and serves as a master equation that unifies an entire zoo of methods and algorithms.

This conceptual model can be especially helpful in research and theory development, since many disparate algorithms become special cases of the same principle!

Category	Prior $P(\tau)$	Variational Family $Q(\tau)$	Limit / Scaling / Approximation
Discrete Planning	Random walk or heuristic biases	Dirac (single best path) or rollout policy	$\beta \to \infty$ (zero-temperature), beam cutoff, sampling
Continuous Control	Passive (uncontrolled) dynamics	Gaussian around nominal trajectory	Linearized dynamics & quadratic cost → iLQR / DDP
Stochastic Optimal Control	Uncontrolled diffusion	Reweighted trajectory samples	Path-integral importance sampling (exponential reweighting)
Modern RL	Uniform or previous policy	Parameterized stochastic policy	Entropy regularization (SAC), KL trust region (PPO, REPS)
Imitation & IRL	Expert demonstrations	Soft policy matching expert moments	MaxEnt moment matching (MaxEnt IRL), adversarial cost (GAIL)
Hierarchical / Meta Learning	Priors over skills or tasks $P(z)$	Latent-augmented policies $Q(z,\tau)$	Amortized inference, nested KL penalties (options, PEARL)
Risk-Sensitive & Robust Control	Adversarial or perturbed dynamics	Risk-averse or worst-case policy	Generalized divergences (Rényi, CVaR), distributional robustness
Multi-Agent & Game Solving	Factorized priors over agent behaviors	Joint or factored multi-agent distributions $Q(\tau_1…)$	Decentralized variational updates → Nash / mean-field equilibria
Generative Flow Networks	Unnormalized reward over end states	Policy over construction trajectories	Flow-matching via variational objectives

To make this master objective practical, we must always consider the underlying approximations: linearizations, sampling methods, limited expressivity in $Q$, or surrogate optimization schemes.

Overall, I have found the Free-Energy Principle to serve as a very useful heuristic and unifying principle.

References

[^1] Kappen, H. J., Gómez, V., & Opper, M. (2012). Optimal control as a graphical model inference problem. Machine learning, 87, 159-182. [^2] Kappen, H. J. (2005). Path integrals and symmetry breaking for optimal control theory. Journal of statistical mechanics: theory and experiment, 2005(11), P11011. [^3] Ortega, P. A., & Braun, D. A. (2013). Thermodynamics as a theory of decision-making with information-processing costs. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 469(2153), 20120683.] [^4] Levine, S. (2018). Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909. [^5] Solway, A., & Botvinick, M. M. (2012). Goal-directed decision making as probabilistic inference: a computational framework and potential neural correlates. Psychological review, 119(1), 120. [^5] Friston, K. (2010). The free-energy principle: a unified brain theory?. Nature reviews neuroscience, 11(2), 127-138. [^6] Rawlik, K., Toussaint, M., & Vijayakumar, S. (2012). On stochastic optimal control and reinforcement learning by approximate inference. Proceedings of Robotics: Science and Systems VIII. [^7] Todorov, E. (2006). Linearly-solvable Markov decision problems. Advances in neural information processing systems, 19. [^8] Todorov, E. (2008, December). General duality between optimal control and estimation. In 2008 47th IEEE conference on decision and control (pp. 4286-4292). IEEE. [^9] Todorov, E. (2010). Policy gradients in linearly-solvable MDPs. Advances in Neural Information Processing Systems, 23. [^10] Botvinick, M., & Toussaint, M. (2012). Planning as inference. Trends in cognitive sciences, 16(10), 485-488. [^11] Ziebart, B. D., Bagnell, J. A., & Dey, A. K. (2010). Modeling interaction via the principle of maximum causal entropy. [^12] Waugh, K., Ziebart, B. D., & Bagnell, J. A. (2013). Computational rationalization: The inverse equilibrium problem. arXiv preprint arXiv:1308.3506. [^12] Jaynes, E. T. (1957). Information theory and statistical mechanics. Physical review, 106(4), 620. [^13] Parr, T., Pezzulo, G., & Friston, K. J. (2022). Active inference: the free energy principle in mind, brain, and behavior. MIT Press.