Sunday, December 9, 2012

More on house effects over time

Early last decade, Simon Jackman published his Bayesian approach to poll aggregation. It allowed the house effects (systemic biases) of a polling house to be calibrated (either absolutely in terms of a known election outcome, or relatively against the average of all the other polling houses).

Jackman's approach was two-fold. He theorised that voting intention typically did not change much day-to-day (although his model allows for occasional larger movement in public opinion). On most days, the voting intention of the public is much the same as it was on the previous day. In his model, he identified the most likely path that voting intention took each and every day through the period under analysis. This day-to-day track of likely voting intention then must line up (as best it can) with the published polls as they occurred during this period. To help the modeled day-to-day walk of public opinion line up with the published polls, Jackman's approach assumed that each polling house had a  systemic bias which is normally distributed around a constant number of percentage points above or below the actual population's voting intention.

Jackman's approach works brilliantly over the short run. In the next chart, which is based on a 100,000 simulation of possible walks that satisfies the constraints in the model, we pick out the median pathway for each day over the last six months. The result is a reasonably smooth curve. While I have not labeled the end point in the median series, it was 47.8 per cent.

However, over longer periods, Jackman's model is less effective. The problem is the assumption that the distribution of house effects remains constant over time. This is not the case. In the next chart, we apply the same 100,000 simulation approach as above, but to the data since the last election. The end point for this chart is 47.7 per cent.

It looks like the estimated population voting intention line is more choppy (because the constantly distributed house effects element of the model is contributing less to the analysis over the longer run). Previously I noted that over the last three years, Essential's house effect has moved around somewhat in comparison to the other polling houses.

All of this got me wondering whether it was possible to design a model that identified this movement in house effects over time - on (say) a six month rolling average basis. My idea was to take the median line from Jackman's model and use it to benchmark the polling houses.  I also wondered whether I could then use the newly identified time-varying house-effect to better identify the underlying population voting intention.

The first step of taking a six month rolling average against the original Jackman line was simple as can be seen in the next chart (noting this is a 10,000 run simulation).

However, designing a model where the fixed and variable sides of the model informed each other proved more challenging than I had anticipated (in part because the JAGS program requires the specification of a directed acyclic graph). At first, I could not find an easy way for the fixed effect side of the model to inform variable effects side of the model and for the variable effects side to inform the fixed effects side, without the whole model becoming a cyclical graph.

When I finally solved the problem, a really nice chart for population voting intention popped out the other end (after 2.5 hours of computer time for the 100,000 run simulation).

Also, the six-monthly moving average for the house effects (which is measured against the line) looked a touch smoother (but this may be the result of a 100,000 run versus a 10,000 run for the earlier chart).

This leads me to another observation. A number of other blogs interested in poll aggregation ignore or down-weight the Morgan face-to-face poll series. I have been asked why I use it.

I use the Morgan face to face series because it is fairly consistent in respect of the other polls. It is a bit like comparing a watch that is consistently five minutes slow with a watch that is sometimes a minute or two fast and at other times a minute or two slow, but which moves randomly between theses two states. A watch that is consistently slow is more informative once it has been benchmarked than a watch that might be closer to the actual time, but whose behaviour around the actual time is random. In short, I think the people who ignore or down-play this Morgan series are not taking advantage of really useful information.

Back to the model: All of the volatility ended up in the variable effects daily walk, which is substantially influenced by the outliers.

For the nerds: My JAGS code for this is a bit more complicated than for earlier models. The variables y and y2 are the polling observations over the period (the series are identical - this is how I ensured the graph was acyclical). The observations are ordered in date order. The lower and upper variables map the range of the six-month centred window for estimating the variable effects against the fixed effects (this is calculated in R before handing to JAGS for the MCMC simulation). The lines marked with a triple $ sign are the lines that allow the fixed and variable elements of the model to inform each other.

    model {
        ## -- temporal model for voting intention (VI)
        for(i in 2:PERIOD) { # for each day under analysis ...
            VI[i] ~ dnorm(VI[i-1], walkVIPrecision)     # fixed effects walk
            VI2[i] ~ dnorm(VI2[i-1], walkVIPrecision2)  # $$$
        ## -- initial fixed house-effects observational model
        for(i in 1:NUMPOLLS) { # for each poll result ...
            roundingEffect[i] ~ dunif(-houseRounding[i], houseRounding[i])
            yhat[i] <- houseEffects[ house[i] ] + VI[ day[i] ] + roundingEffect[i]  ## system
            y[i] ~ dnorm(yhat[i], samplePrecision[i])                               ## distribution
        ## -- variable effects 6-month window adjusted observational model
        for(i in 1:NUMPOLLS) { # for each poll result ...
            count[i] <- sum(house[ lower[i]:upper[i] ] == house[i])
            adjHouseEffects[i] <- sum( (y[ lower[i]:upper[i] ] - VI[ day[i] ]) *
                (house[ lower[i]:upper[i] ] == house[i]) ) / count[i]
            roundingEffect2[i] ~ dunif(-houseRounding[i], houseRounding[i])     # $$$
            yhat2[i] <- adjHouseEffects[i] + VI2[ day[i] ] + roundingEffect2[i] # $$$
            y2[i] ~ dnorm(yhat2[i], samplePrecision[i])                         # $$$
        ## -- point-in-time sum-to-zero constraint on constant house effects
        houseEffects[1] <- -sum( houseEffects[2:HOUSECOUNT] )

        ## -- priors
        for(i in 2:HOUSECOUNT) { ## vague normal priors for house effects
            houseEffects[i] ~ dnorm(0, pow(0.1, -2))

        sigmaWalkVI ~ dunif(0, 0.01)            ## uniform prior on std. dev.  
        walkVIPrecision <- pow(sigmaWalkVI, -2) ##   for the day-to-day random walk
        VI[1] ~ dunif(0.4, 0.6)                 ## initialisation of the voting intention daily walk

        sigmaWalkVI2 ~ dunif(0, 0.01)             ## $$$  
        walkVIPrecision2 <- pow(sigmaWalkVI2, -2) ## $$$
        VI2[1] ~ dunif(0.4, 0.6)                  ## $$$

I suspect this is more complicated than it needs to be; any help in simplifying the approach would be appreciated.

No comments:

Post a Comment