MCMC¶
The pyRSD package wraps the emcee
package to provide functionality
for running MCMC chains to sample the posterior distributions of the
model parameters.
Command-line Options¶
MCMC chains are run by passing the mcmc
sub-command
to the rsdfit
executable. The calling sequence for the mcmc
command is
$ rsdfit mcmc -h
usage: rsdfit [-h] [--version] {mcmc,nlopt,restart,analyze} ...
From more help on each of the subcommands, type:
rsdfit mcmc -h
rsdfit nlopt -h
rsdfit restart -h
rsdfit analyze -h mcmc
[-h] [-m MODEL] [-p PARAMS] [--silent] -w WALKERS -i ITERATIONS
[-n NCHAINS] -o FOLDER [--debug] [--no-save-model]
optional arguments:
-h, --help show this help message and exit
-m MODEL, --model MODEL
file name holding the model path
-p PARAMS, --params PARAMS
file name holding the driver, theory, and data
parameters
--silent silence the standard output to the console
-w WALKERS number of emcee walkers to run the MCMC chain
(required)
-i ITERATIONS number of steps to run in the MCMC chain (required)
-n NCHAINS, --nchains NCHAINS
number of MCMC chains to run concurrently
-o FOLDER, --output FOLDER
the folder where the results will be written
(required)
--debug whether to print more info about the mpi4py.Pool
object
--no-save-model do not save the model instance
The main options are the parameter file, passed by the -p
option,
the directory to save results, passed by the -o
option, and the
name of a model to load, passed by the -m
file. In addition, there
are some MCMC-specific options:
-w, —walkers
The number of
emcee
walkers to use. These walkers are responsible for exploring the relevant parameter space simulataneously.Note
The number of walkers must be at least twice the number of model parameters, and generally, the more walkers used the better. However, a trade-off exists since more walkers can become computationally infeasible when model evaluation is slow. In the case of the default model with 13 parameters, we recommend using 30-40 walkers.
-i, —iterations
The number of iterations to run in the MCMC chain.
Note
Generally, the models in the pyRSD package require >1000 iterations to achieve convergence, depending on how the MCMC sampler is initialized.
–nchains
If the
rsdfit
is executed in parallel with multiple processes using MPI, it is possible to run multiple chains with MCMC concurrently with this option. This is ideal for testing the convergence of the sampler, as the chains will be compared statistically to determine if they have converged to a similar point in parameter space.
Initializing the MCMC Chains¶
The method used to initialize either the MCMC chains can be configured
using the driver.init_from
parameter. The allowed values of this parameter
when using the MCMC solver are:
fiducial :
Initialize the parameters in a small ball around the fiducial parameter values, specified in the parameter file via the
fiducial
keyword for each free parameterprior :
Initialize the free parameters by drawing a value from the prior probability specified for each parameter, which will be either a uniform or normal distribution
result :
Initialize the free parameters in a small ball around the best-fit parameters loaded from a previous result. In this case, the
driver.start_from
parameter should give the name of a.npz
, which can be loaded into either apyRSD.rsdfit.results.LBFGSResults
orpyRSD.rsdfit.results.EmceeResults
object.
Testing for Convergence¶
The convergence of the MCMC chain being run will be tested if the
driver.test_convergence
parameter is set to True
. This test is performed
using the Gelman and Rubin diagnostic, and the tolerance
level for the test is specified via the driver.epsilon
parameter.
The convergence for MCMC chains is performed as
- Remove the first half of the current chains
- Calculate the within chain and between chain variances
- Estimate the variance from the within chain and between chain variance
- Calculate the potential scale reduction parameter and compare to
epsilon
Note
Convergence testing is disabled for MCMC chains by default, as it is often
easier to run a pre-defined number of iterations (passed by the -i
flag),
which will be specific to the user’s problem at hand.
Recommended Practices¶
The emcee
package recommends initializing its MCMC sampler in a small
ball around what the user believes to be the area in parameter space close
to the maximum probability. As such, we recommend either of the following
initialization methods for best results:
1. Run the NLOPT solver, starting from a fiducial set of values, and then inititialize the MCMC solver with the result of that NLOPT run.
2. Initialize the MCMC solver with values drawn from the parameters’ prior distribution and run a set of burn-in iterations (typically ~1000 or so). Then, initialize a second MCMC chain from the best-fit result of that burn-in period.
For more recommended practices regarding the emcee
package, please
see the FAQ on the emcee documentation here.
Finally, it is also useful for convergence reasons to run multiple chains at once in parallel. Often running two chains, independently initialized, with half the number of iterations is useful to asses if the chains have truly found the best-fit parameters.