Planning & Control = Inference
🧠 TL;DR
Planning and control can be framed as Bayesian inference: by treating costs as (unnormalized) likelihoods and dynamics as priors, decision-making reduces to inferring a posterior over optimal trajectories.
Nearly all planning, optimal control, and reinforcement learning algorithms can be derived as variational inference problems of the form:
\[\min_Q\;\mathbb{E}_Q[C(\tau)] + \mathrm{KL}(Q(\tau) \,\|\, P(\tau))\]by varying:
- the form of the prior $P(\tau)$ — e.g. passive dynamics, old policy, or expert behavior;
- the expressivity of the variational family $Q(\tau)$ — e.g. deterministic plans, stochastic policies, hierarchical structures;
- the approximation scheme — e.g. sampling, linearization, or taking a low-temperature limit.
This objective comes from the free energy functional:
\[\mathcal{F}[Q] = \mathbb{E}_Q[C(\tau)] + \mathrm{KL}(Q(\tau) \,\|\, P(\tau))\]and serves as a master equation that unifies an entire zoo of methods and algorithms.
This conceptual model can be especially helpful in research and theory development, since many disparate algorithms become special cases of the same principle!
Category | Prior $P(\tau)$ | Variational Family $Q(\tau)$ | Limit / Scaling / Approximation |
---|---|---|---|
Discrete Planning | Random walk or heuristic biases | Dirac (single best path) or rollout policy | $\beta \to \infty$ (zero-temperature), beam cutoff, sampling |
Continuous Control | Passive (uncontrolled) dynamics | Gaussian around nominal trajectory | Linearized dynamics & quadratic cost → iLQR / DDP |
Stochastic Optimal Control | Uncontrolled diffusion | Reweighted trajectory samples | Path-integral importance sampling (exponential reweighting) |
Modern RL | Uniform or previous policy | Parameterized stochastic policy | Entropy regularization (SAC), KL trust region (PPO, REPS) |
Imitation & IRL | Expert demonstrations | Soft policy matching expert moments | MaxEnt moment matching (MaxEnt IRL), adversarial cost (GAIL) |
Hierarchical / Meta Learning | Priors over skills or tasks $P(z)$ | Latent-augmented policies $Q(z,\tau)$ | Amortized inference, nested KL penalties (options, PEARL) |
Risk-Sensitive & Robust Control | Adversarial or perturbed dynamics | Risk-averse or worst-case policy | Generalized divergences (Rényi, CVaR), distributional robustness |
Multi-Agent & Game Solving | Factorized priors over agent behaviors | Joint or factored multi-agent distributions $Q(\tau_1…)$ | Decentralized variational updates → Nash / mean-field equilibria |
Generative Flow Networks | Unnormalized reward over end states | Policy over construction trajectories | Flow-matching via variational objectives |
To make this master objective practical, we must always consider the underlying approximations: linearizations, sampling methods, limited expressivity in $Q$, or surrogate optimization schemes.
Overall, I have found the Free-Energy Principle to serve as a very useful heuristic and unifying principle.