Note
Click here to download the full example code
MCMC sampling#
In this example, we demonstrate how to perform uncertainty quantification (UQ) using parallel tempered MCMC (PTMCMC). We use a Stillinger-Weber (SW) potential for silicon that is archived in OpenKIM.
For simplicity, we only set the energy-scaling parameters, i.e., A
and lambda
as
the tunable parameters. Furthermore, these parameters are physically constrained to be
positive, thus we will work in log parameterization, i.e. log(A)
and log(lambda)
.
These parameters will be calibrated to energies and forces of a small dataset,
consisting of 4 compressed and stretched configurations of diamond silicon structure.
To start, let’s first install the SW model:
$ kim-api-collections-management install user SW_StillingerWeber_1985_Si__MO_405512056662_006
See also
This installs the model and its driver into the User Collection
. See
Install a model for more information about installing KIM models.
from multiprocessing import Pool
import numpy as np
from corner import corner
from kliff.calculators import Calculator
from kliff.dataset import Dataset
from kliff.dataset.weight import MagnitudeInverseWeight
from kliff.loss import Loss
from kliff.models import KIMModel
from kliff.models.parameter_transform import LogParameterTransform
from kliff.uq import MCMC, autocorr, mser, rhat
from kliff.utils import download_dataset
Before running MCMC, we need to define a loss function and train the model. More detail information about this step can be found in Train a Stillinger-Weber potential and Parameter transformation for the Stillinger-Weber potential.
# Instantiate a transformation class to do the log parameter transform
param_names = ["A", "lambda"]
params_transform = LogParameterTransform(param_names)
# Create the model
model = KIMModel(
model_name="SW_StillingerWeber_1985_Si__MO_405512056662_006",
params_transform=params_transform,
)
# Set the tunable parameters and the initial guess
opt_params = {
"A": [["default", -8.0, 8.0]],
"lambda": [["default", -8.0, 8.0]],
}
model.set_opt_params(**opt_params)
model.echo_opt_params()
# Get the dataset and set the weights
dataset_path = download_dataset(dataset_name="Si_training_set_4_configs")
# Instantiate the weight class
weight = MagnitudeInverseWeight(
weight_params={
"energy_weight_params": [0.0, 0.1],
"forces_weight_params": [0.0, 0.1],
}
)
# Read the dataset and compute the weight
tset = Dataset(dataset_path, weight=weight)
configs = tset.get_configs()
# Create calculator
calc = Calculator(model)
ca = calc.create(configs)
# Instantiate the loss function
residual_data = {"normalize_by_natoms": False}
loss = Loss(calc, residual_data=residual_data)
# Train the model
loss.minimize(method="L-BFGS-B", options={"disp": True})
model.echo_opt_params()
Out:
#================================================================================
# Model parameters that are optimized.
# Note that the parameters are in the transformed space if
# `params_transform` is provided when instantiating the model.
#================================================================================
A 1
2.7268620056558381e+00 -8.0000000000000000e+00 8.0000000000000000e+00
lambda 1
3.8184197679684773e+00 -8.0000000000000000e+00 8.0000000000000000e+00
/home/yonatank/.local/lib/python3.8/site-packages/numpy/linalg/linalg.py:2500: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
x = asarray(x)
2022-08-10 17:00:23.489 | INFO | kliff.dataset.dataset:_read:398 - 4 configurations read from /home/yonatank/modules/kliff/examples/Si_training_set_4_configs
2022-08-10 17:00:23.493 | INFO | kliff.calculators.calculator:create:107 - Create calculator for 4 configurations.
2022-08-10 17:00:23.494 | INFO | kliff.loss:minimize:290 - Start minimization using method: L-BFGS-B.
2022-08-10 17:00:23.496 | INFO | kliff.loss:_scipy_optimize:404 - Running in serial mode.
2022-08-10 17:00:23.675 | INFO | kliff.loss:minimize:292 - Finish minimization using method: L-BFGS-B.
#================================================================================
# Model parameters that are optimized.
# Note that the parameters are in the transformed space if
# `params_transform` is provided when instantiating the model.
#================================================================================
A 1
2.7269268430321811e+00 -8.0000000000000000e+00 8.0000000000000000e+00
lambda 1
3.8183682461406869e+00 -8.0000000000000000e+00 8.0000000000000000e+00
'#================================================================================\n# Model parameters that are optimized.\n# Note that the parameters are in the transformed space if \n# `params_transform` is provided when instantiating the model.\n#================================================================================\n\nA 1\n 2.7269268430321811e+00 -8.0000000000000000e+00 8.0000000000000000e+00 \n\nlambda 1\n 3.8183682461406869e+00 -8.0000000000000000e+00 8.0000000000000000e+00 \n\n'
To perform MCMC simulation, we use MCMC
.This class interfaces with
ptemcee Python package to run PTMCMC, which utilizes the affine invariance property
of MCMC sampling. We simulate MCMC sampling at several different temperatures to
explore the effect of the scale of bias and overall error bars.
# Define some variables that correspond to the dimensionality of the problem
ntemps = 4 # Number of temperatures to simulate
ndim = calc.get_num_opt_params() # Number of parameters
nwalkers = 2 * ndim # Number of parallel walkers to simulate
We start by instantiating MCMC
. This requires Loss
instance to construct the likelihood function. Additionally, we can specify the prior
(or log-prior to be more precise) via the logprior_fn
argument, with the default
option be a uniform prior that is bounded over a finite range that we specify via the
logprior_args
argument.
Note
When user uses the default uniform prior but doesn’t specify the bounds, then the
sampler will retrieve the bounds from the model
(see set_opt_params()
). Note that an error will be
raised when the uniform prior extends to infinity in any parameter direction.
To specify the sampling temperatures to use, we can use the arguments ntemps
and
Tmax_ratio
to set how many temperatures to simulate and the ratio of the highest
temperature to the natural temperature , respectively. The default values of
ntemps
and Tmax_ratio
are 10 and 1.0, respectively. Then, an internal function
will create a list of logarithmically spaced points from to
. Alternatively, we can also give a list of
the temperatures via
Tladder
argument, which will overwrites ntemps
and
Tmax_ratio
.
Note
It has been shown that including temperatures higher than helps the
convergence of walkers sampled at
.
The sampling processes can be parallelized by specifying the pool. Note that the pool
needs to be declared after instantiating MCMC
, since the posterior
function is defined during this process.
# Set the boundaries of the uniform prior
bounds = np.tile([-8.0, 8.0], (ndim, 1))
# It is a good practice to specify the random seed to use in the calculation to generate
# a reproducible simulation.
seed = 1717
np.random.seed(seed)
# Create a sampler
sampler = MCMC(
loss,
ntemps=ntemps,
logprior_args=(bounds,),
random=np.random.RandomState(seed),
)
# Declare a pool to use parallelization
sampler.pool = Pool(nwalkers)
Note
As a default, the algorithm will set the number of walkers for each sampling
temperature to be twice the number of parameters, but we can also specify it via
the nwalkers
argument.
To run the MCMC sampling, we use run_mcmc()
. This function requires
us to provide initial states for each temperature and walker. We also need
to specify the number of steps or iterations to take.
Note
The initial states need to be an array with shape
(K, L, N,)
, where
K
, L
, and N
are the number of temperatures, walkers, and parameters,
respectively.
# Initial starting point. This should be provided by the user.
p0 = np.empty((ntemps, nwalkers, ndim))
for ii, bound in enumerate(bounds):
p0[:, :, ii] = np.random.uniform(*bound, (4, 4))
# Run MCMC
sampler.run_mcmc(p0, 5000)
sampler.pool.close()
# Retrieve the chain
chain = sampler.chain
The resulting chains still need to be processed. First, we need to discard the first few
iterations in the beginning of each chain as a burn-in time. This is similar to the
equilibration time in a molecular dynamic simulation before we can start the
measurement. KLIFF provides a function to estimate the burn-in time, based on the
Marginal Standard Error Rule (MSER). This can be accessed via
mser()
.
# Estimate equilibration time using MSER for each temperature, walker, and dimension.
mser_array = np.empty((ntemps, nwalkers, ndim))
for tidx in range(ntemps):
for widx in range(nwalkers):
for pidx in range(ndim):
mser_array[tidx, widx, pidx] = mser(
chain[tidx, widx, :, pidx], dmin=0, dstep=10, dmax=-1
)
burnin = int(np.max(mser_array))
print(f"Estimated burn-in time: {burnin}")
Out:
Estimated burn-in time: 480
Note
mser()
only compute the estimation of the burn-in time for
one single temperature, walker, and parameter. Thus, we need to calculate the burn-in
time for each temperature, walker, and parameter separately.
After discarding the first few iterations as the burn-in time, we only want to keep
every -th iteration from the remaining chain, where
is the
autocorrelation length, to ensure uncorrelated samples.
This calculation can be done using
autocorr()
.
# Estimate the autocorrelation length for each temperature
chain_no_burnin = chain[:, :, burnin:]
acorr_array = np.empty((ntemps, nwalkers, ndim))
for tidx in range(ntemps):
acorr_array[tidx] = autocorr(chain_no_burnin[tidx], c=1, quiet=True)
thin = int(np.ceil(np.max(acorr_array)))
print(f"Estimated autocorrelation length: {thin}")
Out:
Estimated autocorrelation length: 15
Note
acorr()
is a wrapper for emcee.autocorr.integrated_time,
As such, the shape of the input array for this function needs to be (L, M, N,)
,
where L
, M
, and N
are the number of walkers, steps, and parameters,
respectively. This also implies that we need to perform the calculation for each
temperature separately.
Finally, after obtaining the independent samples, we need to assess whether the
resulting samples have converged to a stationary distribution, and thus a good
representation of the actual posterior. This is done by computing the potential scale
reduction factor (PSRF), denoted by . The value of
declines to 1 as the number of iterations goes to infinity. A common threshold is about
1.1, but higher threshold has also been used.
# Assess the convergence for each temperature
samples = chain_no_burnin[:, :, ::thin]
threshold = 1.1 # Threshold for rhat
rhat_array = np.empty(ntemps)
for tidx in range(ntemps):
rhat_array[tidx] = rhat(samples[tidx])
print(f"$\hat{{r}}^p$ values: {rhat_array}")
Out:
$\hat{r}^p$ values: [0.99832965 1.07774255 1.10594832 1.07248159]
Note
rhat()
only computes the PSRF for one temperature, so that
the calculation needs to be carried on for each temperature separately.
Notice that in this case, for all temperatures. When this
criteria is not satisfied, then the sampling process should be continued. Note that
some sampling temperatures might converge at slower rates compared to the others.
After obtaining the independent samples from the MCMC sampling, the uncertainty of the parameters can be obtained by observing the distribution of the samples. As an example, we will use corner Python package to present the MCMC result at sampling temperature 1.0 as a corner plot.
# Plot samples at T=1.0
corner(samples[0].reshape((-1, ndim)), labels=[r"$\log(A)$", r"$\log(\lambda)$"])

Out:
<Figure size 550x550 with 4 Axes>
Note
As an alternative, KLIFF also provides a wrapper to emcee. This can be accessed by
setting sampler="emcee"
when instantiating MCMC
. For further
documentation, see EmceeSampler
.
Total running time of the script: ( 3 minutes 14.012 seconds)