Sunday, July 12, 2015

Should I junk JAGS? Is Stan the man?

There are a number of analytical tools that enable statisticians to solve Bayesian hierarchical models.

For some time I have been using JAGS, which uses Gibbs sampling in its MCMC algorithm.

Because it is so bloody cold outside, I thought I would give Stan a try. Stan uses Hamiltonian Monte Carlo sampling in its MCMC algorithm. I am using Stan version 2.6.3.0 with the interface to Stan from Python's pystan

Proponents of Stan claim that it has replaced JAGS as the state-of-the-art, black-box MCMC method. Their criticism of JAGS is that Gibbs sampling fails to converge with high posterior correlation. They argue that CPU time and and memory usage under Stan scales much better with model complexity.  As models become larger, JAGS chokes.

The first challenge I faced was getting pystan to work. In the end I deleted my entire python set-up, and reinstalled Anaconda 2.3.0 for OSX from Continuum Analytics. From there it was quick step at the shell prompt:

conda install pystan


Using Stan required me to refactor the model a little. There was the obvious: Stan uses semi-colons to end statements. Some of the functions have been renamed (for example, JAGS' dnorm() becomes Stan's normal()). Some of the functions take different arguments (dnorm() is passed a precession value, while normal() is passed the standard deviation. Comments in Stan begin with a double-slash, whereas in JAGS they begin with a hash (although Stan will accept hashed comments as well). Stan required me to divide the model into a number of different code blocks.

A bigger challenge was how to code the sum-to-zero constraint I place on house effects. In JAGS, I encode it as follows:

    for (i in 2:n_houses) {
houseEffect[i] ~ dnorm(0, pow(0.1, -2))
}
houseEffect[1] <- -sum( houseEffect[2:n_houses] )


In Stan, I needed a different approach using transformed parameters. But I did not need the for-loop, as Stan allows vector-arithmetic in its statements.

    pHouseEffects ~ normal(0, 0.1); // weakly informative parameter
houseEffect <- pHouseEffects - mean(pHouseEffects); // sum to zero transformed


I also ran into all sorts of troubles with uniform distributions generating run-time errors. I am not sure whether this was a result of my poor model specification, something else, or something I should just ignore. In the end, I replaced the problematic uniform distributions with weakly informative (fairly dispersed) normal distributions. Anyway, after a few hours I had a Stan model for aggregating two-party-preferred poll results that was working.

The results for the hidden voting intention generated with Stan are quite similar to those with JAGS. In the next two charts, we have the Stan model first and then the JAGS model. The JAGS model was changed to mirror Stan as much as possible (i.e. I replaced the same uniform distributions in JAGS that I had replaced in Stan).

We also have similar results with the relative house effects. Again, the Stan result precedes JAGS in the following charts. But those of you with a close eye will note that these results differ slightly from the earlier results that came from the JAGS model which used a uniform distribution as a prior for house effects. There is a useful lesson here on the importance of model specification. Also it is far too easy to read too much into results that differ by a few tenths of a percentage point. The reversal of Newspoll and ReachTEL in these charts comes down to hundredths of a percentage point (which is bugger all).

So what does the Stan model look like?

data {
// data size
int<lower=1> n_polls;
int<lower=1> n_span;
int<lower=1> n_houses;

// poll data
real<lower=0,upper=1> y[n_polls];
real<lower=0> sampleSigma[n_polls];
int<lower=1> house[n_polls];
int<lower=1> day[n_polls];
}
parameters {
real<lower=0,upper=1> hidden_voting_intention[n_span];
vector[n_houses] pHouseEffects;
real<lower=0,upper=0.01> sigma;
}
transformed parameters {
vector[n_houses] houseEffect;
houseEffect <- pHouseEffects - mean(pHouseEffects); // sum to zero
}
model{
// -- house effects model
pHouseEffects ~ normal(0, 0.1); // weakly informative

// -- temporal model
sigma ~ uniform(0, 0.01);
hidden_voting_intention[1] ~ normal(0.5, 0.1);
for(i in 2:n_span) {
hidden_voting_intention[i] ~ normal(hidden_voting_intention[i-1], sigma);
}

// -- observational model
for(poll in 1:n_polls) {
y[poll] ~ normal(houseEffect[house[poll]] + hidden_voting_intention[day[poll]], sampleSigma[poll]);
}
}


The revised JAGS code is as follows. The code I have changed for this post is commented out.

model {
## developed from Simon Jackman's hidden Markov model
## - note: poll results are analysed as a value between 0.0 and 1.0

## -- observational model
for(poll in 1:n_polls) { # for each observed poll result ...
yhat[poll] <- houseEffect[house[poll]] + hidden_voting_intention[day[poll]]
y[poll] ~ dnorm(yhat[poll], samplePrecision[poll]) # distribution
}

## -- temporal model
for(i in 2:n_span) { # for each day under analysis, except the first ...
# today's national TPP voting intention looks much like yesterday's
hidden_voting_intention[i] ~ dnorm(hidden_voting_intention[i-1], walkPrecision)
}
# day 1 estimate of TPP between 20% and 80% - a weakly informative prior
#hidden_voting_intention[1] ~ dunif(0.2, 0.8)
hidden_voting_intention[1] ~ dnorm(0.5, pow(0.1, -2))
# day-to-day change in TPP has a standard deviation between 0 and 1
# percentage points
sigmaWalk ~ dunif(0, 0.01)
walkPrecision <- pow(sigmaWalk, -2)

## -- house effects model
#for(i in 2:n_houses) { # for each polling house, except the first ...
#    # assume house effect is somewhere in the range -15 to +15 percentage points.
#    houseEffect[i] ~ dunif(-0.15, 0.15)
#}
for (i in 2:n_houses) {
houseEffect[i] ~ dnorm(0, pow(0.1, -2))
}
# sum to zero constraint applied to the first polling house ...
houseEffect[1] <- -sum( houseEffect[2:n_houses] )
}


As for the question: will I now junk JAGS and move to Stan? I will need to think about that a whole lot more before I make any changes. I love the progress feedback that Stan gives as it processes the samples. I think Stan might be marginally faster. But model specification in Stan is far more fiddly.

Update

I have now timed them. Stan is slower.

3 comments:

1. Have you calculated effective sample sizes per second? Stan is slower on samples/second but often much faster on effective samples per second, except for conditionally conjugate Gibbs models.

2. Have you tried this out with anchors?

3. Hi Mark, I really like the graded shading of your figures. Would it be possible for you to share your figure plotting code? Thanks.