Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

SW20 Readme

1. This document

The rest of this document outlines the overall statistical model and focuses on a few issues related to the numerical estimation of the posterior. In order to understand the underlying Bayesian framework, readers should familiarize themselves with Sherwood, Webb, et al. 2020, and Marvel and Webb 2025.

2. Stan syntax primer

Stan requires us to define the parameters (i.e. the primary, unconstrained parameters that will be Monte Carlo sampled), any transformed parameters (i.e. parameters deterministically constrained by the primary parameters), and the model formulation. To understand how to formulate a model in Stan, let’s take as an example a posterior with two independent parameters θ1,2\theta_{1,2} and two independent pieces of observational evidence y1y_{1}, y2y_{2}, which would make the posterior:

p(θy)=p(y1,y2θ1,θ2)p(θ1,θ2)=p(y1θ1,θ2)p(y2θ1,θ2)p(θ1)p(θ2)p(\theta\mid y)=p(y_{1},y_{2}\mid\theta_{1},\theta_{2})\,p(\theta_{1},\theta_{2})=p(y_{1}\mid\theta_{1},\theta_{2})\,p(y_{2}\mid\theta_{1},\theta_{2})\,p(\theta_{1})\,p(\theta_{2})

If, say, the probabilities of y1,y2,θ1,θ2y_{1},y_{2},\theta_{1},\theta_{2} are Gaussian, we could also write this Bayesian model as

y1θ1,θ2N(μy1(θ1,θ2),  σy1(θ1,θ2))y_{1}\mid\theta_{1},\theta_{2}\sim\mathcal{N} \left(\mu_{y_{1}}(\theta_{1},\theta_{2}),\;\sigma_{y_{1}} \left(\theta_{1},\theta_{2}\right)\right)
y2θ1,θ2N(μy2(θ1,θ2),  σy2(θ1,θ2))y_{2}\mid\theta_{1},\theta_{2}\sim\mathcal{N} \left(\mu_{y_{2}}(\theta_{1},\theta_{2}),\;\sigma_{y_{2}} \left(\theta_{1},\theta_{2}\right)\right)
θ1N(μθ1,σθ1)\theta_{1}\sim\mathcal{N} \left(\mu_{\theta_{1}},\sigma_{\theta_{1}}\right)
θ2N(μθ2,σθ2).\theta_{2}\sim\mathcal{N} \left(\mu_{\theta_{2}},\sigma_{\theta_{2}}\right).

Stan uses two possible formulations. The simplest form of the syntax is based on the second, distributional formulation, where we define the model by simply writing out the four distributions above.

In the back-end, each distributional statement adds to the log-likelihood. For example, since θ1\theta_{1} has a Gaussian distribution,

pθ1(θ1)=(2πσθ12)1/2exp(12(θ1μθ1)2σθ12)p_{\theta_{1}}(\theta_{1})=\left(2\pi\sigma_{\theta_{1}}^{2}\right)^{-1/2}\exp \left(-\frac{1}{2}\frac{\left(\theta_{1}-\mu_{\theta_{1}}\right)^{2}}{\sigma_{\theta_{1}}^{2}}\right)
logpθ1(θ1)=12log(2π)logσθ112(θ1μθ1)2σθ12\log p_{\theta_{1}}(\theta_{1})=-\frac{1}{2}\log \left(2\pi\right)-\log\sigma_{\theta_{1}}-\frac{1}{2}\frac{\left(\theta_{1}-\mu_{\theta_{1}}\right)^{2}}{\sigma_{\theta_{1}}^{2}}

The distributional statement

θ1N(μθ1,σθ1)\theta_{1}\sim\mathcal{N} \left(\mu_{\theta_{1}},\sigma_{\theta_{1}}\right)

is equivalent to adding logpθ1(θ1)\log p_{\theta_{1}}(\theta_{1}) to the log-posterior (the target, in Stan language),

logL=logL+logpθ1(θ1).\log\mathcal{L}=\log\mathcal{L}+\log p_{\theta_{1}}(\theta_{1}).

We can also directly add to log-posterior, and we will need to do that to account for a change of variables if we want to go from uniform SS to uniform λ\lambda priors without having to change the model structure.

3. The Sherwood-Webb et al. model

3.1 Lines of evidence

SW20 uses four lines of evidence (sec. 7 of SW20): process, historical, cold past climates (LGM) and warm past climates (Pliocene). Under the independence ansatz (eq. 11), the posterior for climate sensitivity, SS, is:

p(SEproc,Ehist,Epaleo)p(EprocS)p(EhistS)p(ELGMS)p(EplioS)pS(S)p(S\mid E_{\text{proc}},E_{\text{hist}},E_{\text{paleo}})\propto p \left(E_{\text{proc}}\mid S\right)\,p \left(E_{\text{hist}}\mid S\right)\,p \left(E_{\text{LGM}}\mid S\right)\,p \left(E_{\text{plio}}\mid S\right)\,p_{S}(S)
p(SEproc,Ehist,Epaleo)Lproc(S)Lhist(S)LLGM(S)Lplio(S)pS(S)p(S\mid E_{\text{proc}},E_{\text{hist}},E_{\text{paleo}})\propto\mathcal{L}_{\text{proc}} \left(S\right)\mathcal{L}_{\text{hist}} \left(S\right)\mathcal{L}_{\text{LGM}} \left(S\right)\mathcal{L}_{\text{plio}} \left(S\right)p_{S}(S)

The parameters and their coupling equations are as follows:

3.2 Nuisance parameters, primary parameters, and constrained parameters

A Bayesian calculation computes the probability of parameters, θ\theta, given observations yy:

p(θy)p(yθ)p(θ).p(\theta\mid y)\propto p \left(y\mid\theta\right)p(\theta).

The SW20 framework is unusual in that it does not provide data points yjy_{j} that can be evaluated by plugging them into a parametric likelihood. Rather, it provides distributions of the likelihoods for different lines of evidence, p(EiS)p \left(E_{i}\mid S\right), that are written in terms of the distributions of certain variables (e.g. Thist,Fhist,Nhist,ΔλT_{\text{hist}}, F_{\text{hist}}, N_{\text{hist}}, \Delta\lambda for the historical evidence). These distributions need to be interpreted as likelihoods rather than frequentist sample distributions (see also Marvel and Webb 2025). Let’s take the example of the historical data. For a Bayesian calculation we need to write a likelihood as a probability of the evidence conditional on the parameter SS we want to estimate:

Lhist(S)=p(EhistS).\mathcal{L}_{\text{hist}} \left(S\right)=p \left(E_{\text{hist}}\mid S\right).

However, SW20 does not provide data points, but rather distributions for (Thist,Nhist,Fhist,Δλ)\left(T_{\text{hist}},N_{\text{hist}},F_{\text{hist}},\Delta\lambda\right). The likelihood of the historical evidence can be written as the likelihood of a given combination of variables:

Lhist(S)=p(EhistS)=p(Thist,Nhist,Fhist,Δλ).\mathcal{L}_{\text{hist}} \left(S\right)=p \left(E_{\text{hist}}\mid S\right)=p(T_{\text{hist}},N_{\text{hist}},F_{\text{hist}},\Delta\lambda).

Thus (Thist,Nhist,Fhist,Δλ)\left(T_{\text{hist}},N_{\text{hist}},F_{\text{hist}},\Delta\lambda\right) are not data points but should be viewed as parameters — nuisance parameters, since we will integrate them out to get a marginal likelihood or posterior for the parameter of interest SS.

Ostensibly the likelihood written above is not a function of SS (or λ\lambda). However, SS links F,T,N,ΔλF,T,N,\Delta\lambda such that for a given value of SS, the four values are not independent. Thus, if we introduce SS as a parameter, only three of the other four variables are independent, and one has to become dependent on the others. For example, if we choose temperature as the dependent variable, we can write

T(F,N,Δλ,S)=(FN)λΔλ=(FN)F2×/SΔλ.T(F,N,\Delta\lambda,S)=\frac{- \left(F-N\right)}{\lambda-\Delta\lambda}=\frac{- \left(F-N\right)}{-F_{2\times}/S-\Delta\lambda}.

So we can write temperature as a function of λ\lambda or of S,F2×S, F_{2\times} and write the likelihood as:

p(T,N,F,ΔλS,F2×)=p(TS,F2×)p(N)p(F)p(Δλ)p \left(T,N,F,\Delta\lambda\mid S,F_{2\times}\right)=p(T\mid S,F_{2\times})\,p(N)\,p(F)\,p(\Delta\lambda)
L=p(T,N,F,ΔλS,F2×)exp[((FN)F2×/SΔλμT)22σT2]exp[(NμN)22σN2]exp[(ΔλμΔλ)22σΔλ2]exp[(FμF)22σF2].\mathcal{L}=p \left(T,N,F,\Delta\lambda\mid S,F_{2\times}\right)\propto\exp \left[-\frac{\left(\frac{-\left(F-N\right)}{-F_{2\times}/S-\Delta\lambda}-\mu_{T}\right)^{2}}{2\sigma_{T}^{2}}\right]\exp \left[-\frac{\left(N-\mu_{N}\right)^{2}}{2\sigma_{N}^{2}}\right]\exp \left[-\frac{\left(\Delta\lambda-\mu_{\Delta\lambda}\right)^{2}}{2\sigma_{\Delta\lambda}^{2}}\right]\exp \left[-\frac{\left(F-\mu_{F}\right)^{2}}{2\sigma_{F}^{2}}\right].

Thus temperature is now a constrained parameter that is a deterministic function of other parameters. In practice we will sample the primary (unconstrained) parameters (Nhist,Fhist,Δλ,F2×,S)\left(N_{\text{hist}},F_{\text{hist}},\Delta\lambda,F_{2\times},S\right) via a Monte Carlo technique, and we’ll compute the constrained parameter ThistT_{\text{hist}} (transformed parameter in Stan’s language) from the primary parameters, before plugging it into the likelihood distribution for ThistT_{\text{hist}}. Note that we do not necessarily need to make ThistT_{\text{hist}} the constrained parameter, and we could make NhistN_{\text{hist}}, FhistF_{\text{hist}}, or even Δλ\Delta\lambda instead.

3.3 The full model:

p(SE)Lproc(S)Lhist(S)LLGM(S)Lplio(S)p(ζ)p(F2×)pS(S)Lproc=p(λS,F2×)Lhist=p(ThistS,Nhist,Δλ,Fhist)p(Nhist)p(Fhist)p(Δλ)LLGM=p(FLGMS,TLGM,ζ,α)p(TLGM)p(α)Lplio=p(TplioS,Fplio,fESS,fCH4)p(Fplio)p(fESS)p(fCH4)\begin{aligned} p(S\mid E) & \propto\mathcal{L}_{\text{proc}} \left(S\right)\mathcal{L}_{\text{hist}} \left(S\right)\mathcal{L}_{\text{LGM}} \left(S\right)\mathcal{L}_{\text{plio}} \left(S\right)\,p(\zeta)\,p(F_{2\times})\,p_{S}(S)\\ \mathcal{L}_{\text{proc}} & =p(\lambda\mid S,F_{2\times})\\ \mathcal{L}_{\text{hist}} & =p \left(T_{\text{hist}}\mid S,N_{\text{hist}},\Delta\lambda,F_{\text{hist}}\right)\,p \left(N_{\text{hist}}\right)\,p \left(F_{\text{hist}}\right)\,p(\Delta\lambda)\\ \mathcal{L}_{\text{LGM}} & =p \left(F_{\text{LGM}}\mid S,T_{\text{LGM}},\zeta,\alpha\right)\,p \left(T_{\text{LGM}}\right)\,p(\alpha)\\ \mathcal{L}_{\text{plio}} & =p \left(T_{\text{plio}}\mid S,F_{\text{plio}},f_{\text{ESS}},f_{\text{CH}_4}\right)\,p(F_{\text{plio}})\,p \left(f_{\text{ESS}}\right)\,p(f_{\text{CH}_4}) \end{aligned}

4. Implementation Issues

4.1 Choices of constrained parameters

The coupling equations link SS with the nuisance parameters used in the likelihood formulation. However, it is up to us to choose which of the parameters becomes the constrained parameter. For example, the following two formulations for the historical likelihood are equivalent:

Lhist(S)=p(TS,F2×)p(N)p(F)p(Δλ);withT=(FN)λΔλ=(FN)F2×/SΔλ\mathcal{L}_{\text{hist}} \left(S\right)=p(T\mid S,F_{2\times})\,p(N)\,p(F)\,p(\Delta\lambda); \quad\text{with}\quad T=\frac{-\left(F-N\right)}{\lambda-\Delta\lambda}=\frac{-\left(F-N\right)}{-F_{2\times}/S-\Delta\lambda}
Lhist(S)=p(NS,F2×)p(T)p(F)p(Δλ);withN=FT(F2×/SΔλ).\mathcal{L}_{\text{hist}} \left(S\right)=p(N\mid S,F_{2\times})\,p(T)\,p(F)\,p(\Delta\lambda); \quad\text{with}\quad N=F-T \left(-F_{2\times}/S-\Delta\lambda\right).

However, there are a couple of considerations that can help us decide which parameter to choose as the constrained parameter:

1. Pick the formulation that doesn’t require solving a non-trivial inverse. The LGM equation (eq. 22) is quadratic in TLGMT_{\text{LGM}} because of the state-dependence term α2(ΔT)2\tfrac{\alpha}{2}(\Delta T)^{2}:

α2TLGM2+λ1+ζTLGM[0.57F2×FLGM]=0.\frac{\alpha}{2}T_{\text{LGM}}^{\,2}+\frac{\lambda}{1+\zeta}T_{\text{LGM}}-\bigl[0.57\,F_{2\times}-F_{\text{LGM}}\bigr]=0.

Treating TLGMT_{\text{LGM}} as the S-coupling would require solving this for TLGMT_{\text{LGM}} via the quadratic formula, which has (i) two roots and a branch-picking problem, (ii) a 1/α1/\alpha singularity that is hit constantly because αN(0.1,0.1)\alpha\sim\mathcal{N}(0.1,0.1) passes through zero, and (iii) a discriminant that can go negative near the boundary of physical configurations, producing NaN in autodiff. Going the other way — sampling TLGMT_{\text{LGM}} as primary and computing FLGMF_{\text{LGM}} forward — is a single evaluation of a smooth polynomial, with no branches and no singularities. So FLGMF_{\text{LGM}} is forced to be the S-coupling for LGM.

2. Don’t put a kinked / discontinuous likelihood on a transformed parameter. FhistF_{\text{hist}} has Sherwood’s asymmetric Gaussian observation (different σ\sigma above and below μF\mu_{F}); the log-density jumps by log(σlo/σhi)0.747\log(\sigma_{\text{lo}}/\sigma_{\text{hi}})\approx0.747 at Fhist=μFF_{\text{hist}}=\mu_{F}. In the Stan Hamiltonian Monte Carlo (HMC) sampler, if FhistF_{\text{hist}} were the historical S-coupling (transformed parameter receiving the asymmetric likelihood), that discontinuity would propagate through the Hamiltonian energy-balance equation into the joint log-density in primary-parameter space, biasing HMC. So we do not want to use FhistF_{\text{hist}}. We could choose either NhistN_{\text{hist}} or Δλ\Delta\lambda instead.

The primary unconstrained parameters define Stan’s sampling space, and they are declared in the parameters block. We will use the following unconstrained parameters:

θ=[S,F2×,ζ,Fhist,Nhist,Δλ,TLGM,α,CO2,plio,fCH4,fESS].\theta=\bigl[S,\,F_{2\times},\,\zeta,\,F_{\text{hist}},\,N_{\text{hist}},\,\Delta\lambda,\,T_{\text{LGM}},\,\alpha,\,\mathrm{CO}_{2,\text{plio}},\,f_{\mathrm{CH}_4},\,f_{\text{ESS}}\bigr].

Stan allows us to define any number of constrained parameters, which are declared in the transformed parameters block. These will consist not only of the three parameters defined by the three coupling equations needed to write the likelihood, but also any other constrained parameter that might make the model formulation easier, or that we want to output. For example, we may want to define λ=F2×/S\lambda=-F_{2\times}/S or Fplio=log2(CO2,plio/284)F2×F_{\text{plio}}=\log_2 \left(\mathrm{CO}_{2,\text{plio}}/284\right)\cdot F_{2\times}. In this model formulation, I’ve chosen ThistT_{\text{hist}}, FLGMF_{\text{LGM}}, and TplioT_{\text{plio}}. Any other formulation that does not violate rules 1 and 2 should work.

4.2 Avoiding biases from the historical aerosol forcing formulation

tl;dr: FhistF_{hist} has an asymmetric Gaussian likelihood. Sampling FhistF_{hist} directly through Stan’s creates a finite log-density jump at F=μFF=\mu_F that biases HMC. We sample an unconstrained z-score F_z ~ N(0,1) and define F_hist = mu_F + F_z * sigma(sign(F_z)) as a transformed parameter; this induces exactly the same asymmetric distribution but is smooth in sampling space.

SW20- encode the 5/50/95 percentiles of FhistF_{\text{hist}} via two half-Gaussians glued at μF\mu_{F},

pF(F)  =  {1σlo2πexp(12(FμFσlo)2),F<μF,1σhi2πexp(12(FμFσhi)2),FμF,σlo=1.131,  σhi=0.535.p_{F}(F)\;=\;\begin{cases} \dfrac{1}{\sigma_{\text{lo}}\sqrt{2\pi}}\exp \Big(-\tfrac{1}{2}\big(\tfrac{F-\mu_{F}}{\sigma_{\text{lo}}}\big)^{2}\Big), & F<\mu_{F},\\[8pt] \dfrac{1}{\sigma_{\text{hi}}\sqrt{2\pi}}\exp \Big(-\tfrac{1}{2}\big(\tfrac{F-\mu_{F}}{\sigma_{\text{hi}}}\big)^{2}\Big), & F\ge\mu_{F}, \end{cases}\qquad\sigma_{\text{lo}}=1.131,\;\sigma_{\text{hi}}=0.535.

Each half integrates to 12\tfrac{1}{2}, so pFp_{F} is a proper PDF, but it is discontinuous at F=μFF=\mu_F, where the log-density jumps by

ΔlogpF  =  log(σlo/σhi)    0.747.\Delta\log p_{F}\;=\;\log \bigl(\sigma_{\text{lo}}/\sigma_{\text{hi}}\bigr)\;\approx\;0.747.

HMC’s leapfrog integrator assumes a continuously differentiable target, and the jump in (log-)density leads to a slight bias in the posterior.

Fix. Introduce zN(0,1)z\sim\mathcal{N}(0,1) and define FF deterministically:

F(z)  =  μF  +  {zσlo,z<0,zσhi,z0.F(z)\;=\;\mu_{F}\;+\;\begin{cases} z\,\sigma_{\text{lo}}, & z<0,\\ z\,\sigma_{\text{hi}}, & z\ge0. \end{cases}

Under this map, FF has exactly the asymmetric Gaussian density of Section 4.2: for F<μFF<\mu_{F}, we have z=(FμF)/σloz=(F-\mu_{F})/\sigma_{\text{lo}} and dF/dz=σlo\lvert\mathrm{d}F/\mathrm{d}z\rvert=\sigma_{\text{lo}}, so

p(F)  =  pz(z)dF/dz  =  1σlo2πexp(12(FμFσlo)2)p(F)\;=\;\frac{p_{z}(z)}{\lvert\mathrm{d}F/\mathrm{d}z\rvert}\;=\;\frac{1}{\sigma_{\text{lo}}\sqrt{2\pi}}\exp \Bigl(-\tfrac{1}{2}\bigl(\tfrac{F-\mu_{F}}{\sigma_{\text{lo}}}\bigr)^{2}\Bigr)

and analogously on the other side. Critically, in the sampling space (zz-space) the prior is N(0,1)\mathcal{N}(0,1), which is everywhere smooth. The map F(z)F(z) is continuous (both branches give F=μFF=\mu_{F} at z=0z=0); only its derivative has a kink at z=0z=0, which HMC handles correctly. The discontinuous jump in log-density that confused leapfrog in FF-space has been moved to a kink in the transformation, where it is benign.

4.3 Switching between uniform-λ\lambda and uniform-SS priors

SW20 presents two choices of priors: uniform in λ\lambda and uniform in SS. If we write the model in SS-space, with SS as a primary parameter, declaring S with <lower=A, upper=B> is equivalent to a prior on SS that is uniform on (A,B)(A,B).

Setting λ=F2×/S\lambda=-F_{2\times}/S at fixed F2×F_{2\times}, the change of variable gives

SλF2×=F2×λ2=S2F2×,λSF2×=F2×S2.\frac{\partial S}{\partial\lambda}\bigg|_{F_{2\times}}=\frac{F_{2\times}}{\lambda^{2}}=\frac{S^{2}}{F_{2\times}},\qquad\frac{\partial\lambda}{\partial S}\bigg|_{F_{2\times}}=\frac{F_{2\times}}{S^{2}}.

The implicit prior on (λ,F2×)(\lambda,F_{2\times}) is then

p(λ,F2×)=pS(S(λ,F2×))SλF2×pF2×(F2×)  =  pS(S(λ,F2×))F2×λ2pF2×(F2×)    F2×λ2pF2×(F2×),p(\lambda,F_{2\times})=p_{S} \left(S(\lambda,F_{2\times})\right)\,\frac{\partial S}{\partial\lambda}\bigg|_{F_{2\times}}\,p_{F_{2\times}}(F_{2\times})\;=\;p_{S} \left(S(\lambda,F_{2\times})\right)\cdot\frac{F_{2\times}}{\lambda^{2}}\cdot p_{F_{2\times}}(F_{2\times})\;\propto\;\frac{F_{2\times}}{\lambda^{2}}\,p_{F_{2\times}}(F_{2\times}),

i.e. non-uniform in λ\lambda at fixed F2×F_{2\times}. This is the “USU_{S}” prior of Sherwood (sec. 7.2): uniform on SS.

To recover the paper’s UλU_{\lambda} prior (uniform on each component feedback λi\lambda_{i}, which by convolution is approximately uniform on λ\lambda) we can either rewrite the model to use λ\lambda as a primary parameter and give it a uniform prior, or make the change of variables by adding

target += log(F_2xCO2) - 2 * log(S);

This multiplies the joint by S2/F2×S^{2}/F_{2\times}, giving an implicit prior on (λ,F2×)(\lambda,F_{2\times}) proportional to pF2×(F2×)p_{F_{2\times}}(F_{2\times}) — i.e. uniform in λ\lambda at every F2×F_{2\times} slice, with F2×F_{2\times} retaining its Gaussian prior. This matches Sherwood’s UλU_{\lambda} exactly (uniform-λ\lambda ×\times Gaussian-F2×F_{2\times}).

References
  1. Sherwood, S. C., Webb, M. J., Annan, J. D., Armour, K. C., Forster, P. M., Hargreaves, J. C., Hegerl, G., Klein, S. A., Marvel, K. D., Rohling, E. J., Watanabe, M., Andrews, T., Braconnot, P., Bretherton, C. S., Foster, G. L., Hausfather, Z., von der Heydt, A. S., Knutti, R., Mauritsen, T., … Zelinka, M. D. (2020). An Assessment of Earth’s Climate Sensitivity Using Multiple Lines of Evidence. Reviews of Geophysics, 58(4). 10.1029/2019rg000678
  2. Marvel, K., & Webb, M. (2025). Towards robust community assessments of the Earth’s climate sensitivity. Earth System Dynamics, 16(1), 317–332. 10.5194/esd-16-317-2025