Frequentist vs. Bayesian Approaches to SMC
Introduction
Statistical Model Checking (SMC) seeks to estimate or bound the probability $p$ that a stochastic system violates (or satisfies) a formally specified logical property (e.g., temporal logic, PCTL) by sampling executions of the system through simulation. In high‐assurance domains, where certification and safety are paramount, engineers must balance frequentist guarantees, which provide hard, long‐run coverage properties, against Bayesian interpretability, which offers direct probability statements about model parameters given the observed data. This page explores how each approach frames risk and uncertainty in SMC and when to favor one over the other.
TLDR
Frequentist methods have a key guaranteed coverage property: “Over repeated experiments, the interval covers the true parameter at least $(1-\alpha)$ of the time, regardless of the true value of $p$.”
Bayesian credible intervals don’t naturally have this guarantee because they interpret probability as belief about parameters given the observed data, not repeated sampling behavior.
Even if you design a Bayesian credible interval to be wider (more conservative) than Clopper–Pearson, unless you prove or calibrate its frequentist coverage, you don’t have any formally guaranteed coverage, just a reasonable expectation!
This is why in critical fields (e.g., aerospace, nuclear, medical), regulators often prefer Clopper–Pearson or other frequentist methods, where the coverage guarantees are mathematically provable.
Which to Use When?
A short decision table:
Requirement | Frequentist SMC | Bayesian SMC |
---|---|---|
Hard coverage guarantees | ✅ | ⚠️ |
Direct probability statements on $p$ | ❌ | ✅ |
Incorporate prior knowledge | ❌ | ✅ |
Tool support in PRISM / UPPAAL SMC | ✅ | ⚠️ |
Why frequentist SMC dominates in practice
- No priors required
- Repeatability & guaranteed coverage
- Conservatism for safety/regulation
- Tooling & ecosystem adoption
Comparison of Approaches
Frequentist Interpretation
- Interpretation: Given your observed $k$ failures in $N$ trials and confidence level $1-\alpha$, the CP method yields an interval $[L,U]$ such that, if you repeated the same $N$-trial experiment infinitely often, then a fraction $(1-\alpha)$ of those intervals would contain the true failure probability $p$.
- Example 1: “Based on the observed failure rate $\hat p = \frac{k}{N} = \frac{2}{100} = 0.02,$ the point-estimate expected loss is $\hat p \times \$50\text{M} = \$1\text{M}.$ Furthermore, with 95% confidence, the true failure probability is at most $U \approx 0.0707$, so the maximum expected loss is $0.0707 \times \$50\text{M} \;\approx\; \$3.54\text{M}.$”
- Example 2: “With 95% confidence, the true catastrophic-failure rate is no more than $5\times10^{-10}$ failures per flight-hour, i.e., we can bound the risk at at most one failure in 2 billion flight-hours in 95% of identical test programs.”
- Key Takeaway: You get a worst-case bound on risk that holds in at least 95% of repeated experiments, but you do not get a probability statement about $p$ in this single experiment.
Bayesian Interpretation
- Interpretation: Given your observed $k$ failures in $N$ trials, a prior distribution on $p$ (e.g. $\mathrm{Beta}(1,1)$), and confidence level $1-\alpha$, the Bayesian method yields a posterior $\mathrm{Beta}(1+k,\,1+N-k)$ and constructs an interval $[L,U]$ such that
\(\Pr\bigl(p\in[L,U]\mid\text{data}\bigr)=1-\alpha.\) - Example 1: “Based on the posterior mean $\mathbb{E}[p \mid \text{data}] = \frac{3}{102} \approx 0.0294,$ the point-estimate expected loss is $0.0294 \times \$50\text{M} = \$1.47\text{M}.$ Furthermore, with 95% probability (given the data and a uniform prior), the true failure probability lies in the interval $[0.0034, 0.0617],$ so the expected loss lies between $0.0034 \times \$50\text{M} = \$0.17\text{M}$ and $0.0617 \times \$50\text{M} = \$3.085\text{M}.$”
- Example 2: “Given the test data and a uniform prior, there’s a 95% probability that the catastrophic-failure rate is below $4\times10^{-10}$ per flight-hour, equivalent to fewer than one failure in about 2.5 billion flight-hours.”
- Key Takeaway: You obtain a direct probability statement about $p$ (and hence about risk) given the data, plus both a “best‐estimate” (the posterior mean) and a 95% credible range on the expected loss.
Deep Dive
Frequentist SMC
– Core procedures
- Clopper–Pearson intervals for fixed-sample inference
- Sequential Probability Ratio Test (SPRT) for early stopping
- Wald’s approximations (normal, Wilson) for large-sample bounds
– Guarantees
- Controls Type I error at level $\alpha$
- Coverage ≥ $(1−\alpha)$ across repeated experiments
- Worst-case risk bounds with no reliance on priors
– When & why to apply
- Certification or regulatory scenarios requiring provable error rates
- Quick violation detection via SPRT when failures are rare
- Use in toolchains (PRISM, UPPAAL SMC) that default to frequentist tests
Bayesian SMC
– Choosing priors
- Uniform $\mathrm{Beta}(1,1)$ for non-informative baseline
- Jeffreys $\mathrm{Beta}(\frac{1}{2},\frac{1}{2})$ for objective, near-frequentist coverage
- Robust priors (e.g. $\mathrm{Beta}(\alpha,\alpha)$ with $\alpha<1$) to widen intervals conservatively
– Computing posteriors & credible intervals
- Closed-form Beta–Binomial update for simple counts
- Numerical inversion or Monte Carlo for complex properties
- Posterior predictive checks for model validation
– Hybrid/robust approaches to restore coverage
- Calibrated priors that match Clopper–Pearson bounds
- Worst-case credible intervals over a class of priors
- Mixture or hierarchical priors for sensitivity analysis
– When & why to apply
- Incorporating expert or historical data into SMC
- Sequential updating as new simulation results arrive
- Generating direct probability statements for risk-based decisions
References & Further Reading
- See papers on Clopper-Pearson vs. Jeffreys intervals, confidence posteriors, and robust Bayesian SMC.