The broad description for the method of poll aggregation I use on this site is the hidden Markov model (HMM). These models are analogous to the state-space models developed in the 1960s (beginning with the Kalman Filter). Hidden Markov models are one subset of Bayesian hierarchical models.
Using a HMM, I model the population voting intention (which cannot be observed directly - it is "hidden") as a series of states (either daily or weekly, depending on the model). Each state is directly dependent on the previous state and a probability distribution linking the states. Collectively, these links form a Markov chain or process. The model is informed by irregular and noisy data from the selected polling houses.
Solving the model necessitates integration over a series of complex multidimensional probability distributions. The definite integral is typically impossible to solve algebraically. But it can be solved using a numerical method based on Markov chains and random numbers known as Markov Chain Monte Carlo (MCMC) integration. I use a free software product called JAGS to solve the model.
The two poll aggregation models I had developed previously were dynamic linear models. In both models, the probability distribution linking the hidden states in the Markov model was the normal or Gaussian distribution: the bell-curve of high school statistics. Similarly, the probability distribution linking the hidden states with the polling observations was also the normal distribution. It is this dual independent use of the normal distribution that makes these models: dynamic linear models.
For the model based on the univariate Coalition two-party preferred poll results, the dynamic linear model works a treat. However, the multivariate primary vote model was far more difficult to construct. Quite some effort went into constraining the primary vote shares so that they always summed to one. It resulted in a large (and very slow) directed acyclic graph.
To address this complexity problem, I wondered whether a Dirichlet distribution could be used. Named for Peter Gustav Lejeune Dirichlet, the distribution is pronounced dirik-lay. The Dirichlet distribution has a number of useful features: First it is a multivariate distribution. Second, the output from the distribution always sums to one. It sounded ideal for modelling a time series of dynamically changing, continuous primary vote proportions (all on the unit interval).
In 2013, Emil Aas Stoltenberg wrote a paper on Bayesian Forecasting of Election Results in Multiparty Systems. In that paper he explored a Dirichlet-Multinomial process for the hidden Markov model. His model was coded in Python.
The Coalition TPP estimate from the this model over 6-months, and over the period since the previous election is not dissimilar to the output from the two dynamic linear models.
Turning to the model, rather than estimate the poll result, I use the multinomial distribution to estimate the number of people in each poll sample who expressed a preference for each of the parties. This is a very different approach to my other models. So that you can see it, I will include the R-code where I set up the input data for the model.
It should go without saying that the usual caveats apply. This is new code, and may include bugs. You will note that I am testing a few options in different places (see comments).
df <- read.csv(args[6], header=TRUE) df <- df[order(df$Week), ] NUMPOLLS <- nrow(df) PERIOD <- max(df$Week) HOUSECOUNT <- length(levels(df$House)) # a factor HOUSENAMES <- levels(df$House) PARTYNAMES <- c('Coalition', 'Labor', 'Greens', 'Other') PARTIES <- length(PARTYNAMES) primaryVotes <- df[PARTYNAMES] * df$Sample primaryVotes <- sapply(primaryVotes, function(x) round(x,0)) day0 <- min(as.Date(df$Date)) - 1 ## Assume Coalition gets preferences as follows: ## - 16.97% of the Green vote [was 16.97 in 2013 and 21.16 in 2010] ## - 53.3% of the Other vote [was 53.3 in 2013 and 58.26 in 2010] ## See: Antony Green - http://blogs.abc.net.au/antonygreen/2013/11/ ## preference-flows-at-the-2013-federal-election.html preference_flows <- c(1.0, 0.0, 0.1697, 0.533) data = list(PERIOD = PERIOD, HOUSECOUNT = HOUSECOUNT, NUMPOLLS = NUMPOLLS, PARTIES = PARTIES, primaryVotes = primaryVotes, pollWeek = df$Week, house = as.integer(df$House), # manage rounding issues with df$Sample ... n = rowSums(primaryVotes), preference_flows = preference_flows ) print(data) # ----- JAGS model ... library(rjags) model <- " model { #### -- observational model for(poll in 1:NUMPOLLS) { # for each poll result - rows adjusted_poll[poll, 1:PARTIES] <- walk[pollWeek[poll], 1:PARTIES] + houseEffect[house[poll], 1:PARTIES] primaryVotes[poll, 1:PARTIES] ~ dmulti(adjusted_poll[poll, 1:PARTIES], n[poll]) } #### -- temporal model (a weekly walk where this week is much like last week) tightness <- 10000 # KLUDGE: tightness of fit parameter selected by trial and error for(week in 2:PERIOD) { # Note: use math not a distribution to generate the multinomial ... multinomial[week, 1:PARTIES] <- walk[week-1, 1:PARTIES] * tightness walk[week, 1:PARTIES] ~ ddirch(multinomial[week, 1:PARTIES]) } ## -- weakly informative priors for first week in the temporal model for (party in 1:2) { # for each major party alpha[party] ~ dunif(250, 600) # majors between 25% and 60% } for (party in 3:PARTIES) { # for each minor party alpha[party] ~ dunif(10, 250) # minors between 1% and 25% } walk[1, 1:PARTIES] ~ ddirch(alpha[]) ## -- estimate a Coalition TPP from the primary votes for(week in 1:PERIOD) { CoalitionTPP[week] <- sum(walk[week, 1:PARTIES] * preference_flows[1:PARTIES]) } #### -- sum-to-zero constraints on house effects for (party in 2:PARTIES) { # for each party ... # house effects across houses sum to zero # NOTE: ALL MUST SUM TO ZERO houseEffect[1, party] <- -sum( houseEffect[2:HOUSECOUNT, party] ) } for(house in 1:HOUSECOUNT) { # for each house ... # house effects across the parties sum to zero houseEffect[house, 1] <- -sum( houseEffect[house, 2:PARTIES] ) } # but note, we do not apply a double constraint to houseEffect[1, 1] monitorHouseEffectOneSumParties <- sum(houseEffect[1, 1:PARTIES]) monitorHouseEffectOneSumHouses <- sum(houseEffect[1:HOUSECOUNT, 1]) ## -- vague normal priors for house effects - centred on zero for (party in 2:PARTIES) { # for each party (cols) for(house in 2:HOUSECOUNT) { # (rows) houseEffect[house, party] ~ dnorm(0, pow(0.1, -2)) } } } " jags <- jags.model(textConnection(model), data = data, n.chains=4, n.adapt=n_adapt )
The input for the 6-month model was as follows:
$PERIOD [1] 26 $HOUSECOUNT [1] 5 $NUMPOLLS [1] 35 $PARTIES [1] 4 $primaryVotes Coalition Labor Greens Other [1,] 532 574 154 140 [2,] 560 518 168 154 [3,] 350 410 115 125 [4,] 439 450 139 127 [5,] 385 385 95 135 [6,] 375 395 120 110 [7,] 1465 1483 417 325 [8,] 504 602 154 140 [9,] 532 560 154 154 [10,] 504 602 154 140 [11,] 355 415 120 110 [12,] 412 483 141 141 [13,] 1345 1450 392 312 [14,] 375 405 100 120 [15,] 448 448 142 142 [16,] 588 504 168 140 [17,] 390 380 115 115 [18,] 441 453 139 128 [19,] 380 400 110 110 [20,] 471 425 126 126 [21,] 957 979 278 205 [22,] 405 360 125 110 [23,] 546 532 182 126 [24,] 471 413 126 138 [25,] 385 380 120 115 [26,] 1008 995 301 228 [27,] 400 375 115 110 [28,] 457 410 141 164 [29,] 690 656 185 151 [30,] 603 491 182 126 [31,] 415 355 125 105 [32,] 464 429 139 128 [33,] 1307 1218 385 273 [34,] 410 370 130 90 [35,] 479 433 152 105 $pollWeek [1] 1 1 2 2 6 8 8 9 9 10 10 10 10 12 12 13 14 14 16 16 17 18 19 19 20 [26] 21 22 22 24 24 24 24 24 26 26 $house [1] 1 2 3 4 3 3 5 1 2 1 3 4 5 3 4 2 3 4 3 4 5 3 2 4 3 5 3 4 1 2 3 4 5 3 4 $n [1] 1400 1400 1000 1155 1000 1000 3690 1400 1400 1400 1000 1177 3499 1000 1180 [16] 1400 1000 1161 1000 1148 2419 1000 1386 1148 1000 2532 1000 1172 1682 1402 [31] 1000 1160 3183 1000 1169 $preference_flows [1] 1.0000 0.0000 0.1697 0.5330
Update
This page has been updated. Updates were the result of further analysis, which identified glitches with the original model. The tightness of fit parameter in the temporal model was not resolving. I have returned to specifying the tightness of fit parameter in the model.Second, I have removed the dmulti() - the multinomial distribution step - in the temporal model and replaced it with simple arithmetic to calculate the multinomial. This makes the model a simpler Dirichlet-walk.
No comments:
Post a Comment