🧠 TL;DR

Planning and control can be framed as Bayesian inference: by treating costs as (unnormalized) likelihoods and dynamics as priors, decision-making reduces to inferring a posterior over optimal trajectories.


Nearly all planning, optimal control, and reinforcement learning algorithms can be derived as variational inference problems of the form:

\[\min_Q\;\mathbb{E}_Q[C(\tau)] + \mathrm{KL}(Q(\tau) \,\|\, P(\tau))\]

by varying:

  • the form of the prior $P(\tau)$ — e.g. passive dynamics, old policy, or expert behavior;
  • the expressivity of the variational family $Q(\tau)$ — e.g. deterministic plans, stochastic policies, hierarchical structures;
  • the approximation scheme — e.g. sampling, linearization, or taking a low-temperature limit.

This objective comes from the free energy functional:

\[\mathcal{F}[Q] = \mathbb{E}_Q[C(\tau)] + \mathrm{KL}(Q(\tau) \,\|\, P(\tau))\]

and serves as a master equation that unifies an entire zoo of methods and algorithms.

This conceptual model can be especially helpful in research and theory development, since many disparate algorithms become special cases of the same principle!


Category Prior $P(\tau)$ Variational Family $Q(\tau)$ Limit / Scaling / Approximation
Discrete Planning Random walk or heuristic biases Dirac (single best path) or rollout policy $\beta \to \infty$ (zero-temperature), beam cutoff, sampling
Continuous Control Passive (uncontrolled) dynamics Gaussian around nominal trajectory Linearized dynamics & quadratic cost → iLQR / DDP
Stochastic Optimal Control Uncontrolled diffusion Reweighted trajectory samples Path-integral importance sampling (exponential reweighting)
Modern RL Uniform or previous policy Parameterized stochastic policy Entropy regularization (SAC), KL trust region (PPO, REPS)
Imitation & IRL Expert demonstrations Soft policy matching expert moments MaxEnt moment matching (MaxEnt IRL), adversarial cost (GAIL)
Hierarchical / Meta Learning Priors over skills or tasks $P(z)$ Latent-augmented policies $Q(z,\tau)$ Amortized inference, nested KL penalties (options, PEARL)
Risk-Sensitive & Robust Control Adversarial or perturbed dynamics Risk-averse or worst-case policy Generalized divergences (Rényi, CVaR), distributional robustness
Multi-Agent & Game Solving Factorized priors over agent behaviors Joint or factored multi-agent distributions $Q(\tau_1…)$ Decentralized variational updates → Nash / mean-field equilibria
Generative Flow Networks Unnormalized reward over end states Policy over construction trajectories Flow-matching via variational objectives

To make this master objective practical, we must always consider the underlying approximations: linearizations, sampling methods, limited expressivity in $Q$, or surrogate optimization schemes.

Overall, I have found the Free-Energy Principle to serve as a very useful heuristic and unifying principle.


References

[^1] Kappen, H. J., Gómez, V., & Opper, M. (2012). Optimal control as a graphical model inference problem. Machine learning, 87, 159-182. [^2] Kappen, H. J. (2005). Path integrals and symmetry breaking for optimal control theory. Journal of statistical mechanics: theory and experiment, 2005(11), P11011. [^3] Ortega, P. A., & Braun, D. A. (2013). Thermodynamics as a theory of decision-making with information-processing costs. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 469(2153), 20120683.] [^4] Levine, S. (2018). Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909. [^5] Solway, A., & Botvinick, M. M. (2012). Goal-directed decision making as probabilistic inference: a computational framework and potential neural correlates. Psychological review, 119(1), 120. [^5] Friston, K. (2010). The free-energy principle: a unified brain theory?. Nature reviews neuroscience, 11(2), 127-138. [^6] Rawlik, K., Toussaint, M., & Vijayakumar, S. (2012). On stochastic optimal control and reinforcement learning by approximate inference. Proceedings of Robotics: Science and Systems VIII. [^7] Todorov, E. (2006). Linearly-solvable Markov decision problems. Advances in neural information processing systems, 19. [^8] Todorov, E. (2008, December). General duality between optimal control and estimation. In 2008 47th IEEE conference on decision and control (pp. 4286-4292). IEEE. [^9] Todorov, E. (2010). Policy gradients in linearly-solvable MDPs. Advances in Neural Information Processing Systems, 23. [^10] Botvinick, M., & Toussaint, M. (2012). Planning as inference. Trends in cognitive sciences, 16(10), 485-488. [^11] Ziebart, B. D., Bagnell, J. A., & Dey, A. K. (2010). Modeling interaction via the principle of maximum causal entropy. [^12] Waugh, K., Ziebart, B. D., & Bagnell, J. A. (2013). Computational rationalization: The inverse equilibrium problem. arXiv preprint arXiv:1308.3506. [^12] Jaynes, E. T. (1957). Information theory and statistical mechanics. Physical review, 106(4), 620. [^13] Parr, T., Pezzulo, G., & Friston, K. J. (2022). Active inference: the free energy principle in mind, brain, and behavior. MIT Press.