Bayesian inference
When we make inference on the parameter \(\theta\) given the observed data y by means of obtaining the posterior distribution \(P(\theta | y)\) from Bayes formula \[ P(\theta|y)=\frac{f(y|\theta)\cdot P(\theta)}{f(y)} \] where \(f(y|\theta)\) and \(P(\theta)\) is the likelihood and prior distribution, respectively. The denominator, \(f(y)\) is a normalizing constant, \[ f(Y)= \int_{\theta}{f(Y| \theta)\cdot P(\theta)d\theta} \] to enable \(P(\theta | y)\) to be between 0 and 1 (which is required for it to be a probability distribution). A closed form expression of \(P(\theta | y)\) can be achieved when the posterior and prior has the same probability distribution family as the prior probability distribution \(P(\theta | y)\). For example, in an experiment of repeated Bernoulli trials, the likelihood function for the number of “successes” is expressed as a Binomial distribution and then if we use a Beta distribution as a prior distribution for the probability of “success” the resulting posterior distribution will also be the form of a Beta distribution. If we specify the analysis in such a way, it is called a conjugate analysis.
Non-conjugate analysis and complex experiments
When we want to use non-conjugate priors for simple problems or when the experiment give rise to complex a likelihood function, for example in study designs with hierarchical structures where \(\theta\) is a vector of many parameters, a conjugate analysis can´t be performed and \(P(\theta | y)\) can´t be expressed in closed form. The reason is that even though the likelihood function (conditional on \(\theta\)) often can be expressed, at least approximately, so that we can compute the value in the numerator (\(f(y|\theta)\cdot P(\theta)\)), an analytic solution to the integral \(f(y)\) will in general not be possible to obtain.
If we can´t express \(P(\theta | y)\) in closed form, we can try to approximate it empirically, that is drawing values \(\theta\) where we want the histogram for \(\theta\) to be very similar to \(P(\theta | y)\). The histogram will approximate \(P(\theta | y)\) if we select appropriate values of \(\theta\) in the simulation. But how can we know which values are appropriate? Markov Chain Monte Carlo (MCMC) is an algorithm that selects values for us so that when enough choices have been made (a few thousand) have been made, the histogram will automatically mimic \(P(\theta | y)\).
Markov Chain Monte Carlo (MCMC)
How does MCMC select values of \(\theta\)? I had a hard time to get an intuitive feeling of how MCMC, at least in principle, works. We want to select values of \(\theta\) but for some values of \(\theta\) we want to choose the same value many times and for some values we want to choose the same value a few times (we want the number of times we choose specific values of \(\theta\) to be in approximately the same proportion as the probability \(P(\theta | y)\). The criterion for which value of \(\theta\) one chooses depends on the value of the numerator \(f(y|\theta)\cdot P(\theta)\) for a randomly proposed value of \(\theta\) which is a potential candidate to be elected. If this product has a larger value than the previous proposal, we choose θ. If the product has a smaller value, we discard the new value and select the previous value again. Since \(f(y|\theta)\cdot P(\theta)\) has the same shape as \(P(\theta | y)\), if we do this thousands of times, we will have chosen different values of \(\theta\) different many times. The histogram of these values will approximate \(P(\theta | y)\).
This is my intuitive feeling of how it in principle works. Of course, things can´t be that simple for such a extremely complex thing as sampling from a distribution that you don´t know anything of (and doesn´t exist physically) and a lot of work has been devoted to improve how the MCMC operates. Algorithms have been developed to improve the chain in terms of faster computations and smarter proposals, for example in order to avoid the markov chain to be stuck (or avoid) local domains of the parameter space, such as local maxima or minima in multimodal distributions. This is described in Richard McElreath´s book Statistical Rethinking (see also the videos lecture here). For instance, Hamiltonian Monte Carlo (HMC) algorithm do the MCMC much smarter than earlier ones based on Gibbs sampling and the Metropolis algorithm. Stan uses HMC and its adaptive variant the No-U-Turn Sampler (NUTS). The animation below illustrates how Hamiltonian Monte Carlo picks parameter values. More animations are available here.
knitr::include_url("http://chi-feng.github.io/mcmc-demo/app.html?algorithm=HamiltonianMC&target=banana")
As an applied Bayesian I have personally no ambition to be an expert on MCMC but I think at least some intuitive understanding of it is important to have. What is does is very fascinating: It helps you approximate something unknown that in fact doesn’t exist in the physical world (a probability distribution only have a physical meaning in the frequentist framework). Besides McElreath´s book I can recommend this video as well.