Bayesian Aggregation

This page provides some technical background on the Bayesian poll aggregation models used on this site for the 2022 Federal election.

The data aggregation or data fusion models I use are probably best described as state space models or dynamic linear models. They are sometimes incorrectly called hidden Markov models (HMM). While very similar to HMMs, the hidden state variables in these models are continuous and not discrete (as they are in HMMs). The models are analogous to the Kalman filter, which is a state space model. 

The models are implemented as a Bayesian hierarchical model. Originally I used JAGS to solve these models. For the 2019 Australian election, I moved to Stan (accessed from python using pystan). And for the 2022 election I am using PyMC, which is a software package in the python ecosystem.

I model the national voting intention (which cannot be observed directly; it is "hidden") for each and every day of the period under analysis. The only time the national voting intention is not hidden, is at an election. In some models (known as anchored models), we use the election result to anchor the daily model we use.

In the language of modelling, our estimates of the national voting intention for each day being modeled are known as states. These "states" link together to form a process where each state is directly dependent on the previous state and a probability distribution linking the states. In plain English, the models assume that the national voting intention today is much like it was yesterday.

The model is informed by irregular and noisy data from the selected polling houses. The challenge for the model is to ignore the noise and find the underlying signal. In effect, the model is solved by finding the the day-to-day pathway with the maximum likelihood given the known poll results.

To improve the robustness of the model, we make provision for the long-run tendency of each polling house to systematically favour either the Coalition or Labor. We call this small tendency to favour one side or the other a "house effect". The model assumes that the results from each pollster diverge (on average) from the from real population voting intention by a small, constant number of percentage points. We use the calculated house effect to adjust the raw polling data from each polling house.

In estimating the house effects, we can take one of a number of approaches. We could:

  • anchor the model to an election result on a particular day, and use that anchoring to establish the house effects.
  • anchor the model to a particular polling house or houses; or 
  • assume that collectively the polling houses are unbiased, and that collectively their house effects sum to zero.

Currently, I tend to favour the third approach in my models.

The problem with anchoring the model to an election outcome (or to a particular polling house), is that pollsters are constantly reviewing and, from time to time, changing their polling practice. Over time these changes affect the reliability of anchored models. On the other hand, the sum-to-zero assumption is rarely correct. Nonetheless, in some previous elections, those people who used models that were anchored to the previous election did poorer than those people whose models averaged the bias across all polling houses.

Solving a model necessitates integration over a series of complex multidimensional probability distributions. The definite integral is typically impossible to solve algebraically. But it can be solved using a numerical method based on Markov chains and random numbers known as Markov Chain Monte Carlo (MCMC) integration. As noted above, I use PyMC to solve these models.

The specific model I use, as coded in PyMC, is set out in the following code block.

innovation = 0.15 # from experience ...
model = pm.Model()
with model:
# priors
unanchored_house_bias = pm.Cauchy("unanchored_house_bias",
alpha=0, beta=10, shape=n_brands)
zero_sum_house_bias = pm.Deterministic('zero_sum_house_bias',
var=(unanchored_house_bias - unanchored_house_bias.mean()))

# temporal model
grw = pm.GaussianRandomWalk("grw", mu=0, sd=innovation, shape=n_days)

# the observational model
observed = pm.Normal("observed",
+ zero_sum_house_bias[poll_brand.to_list()],
sigma=measurement_error_sd, observed=zero_centered_y)

Graphically the primary model is as follows. On the day I took this image, there were 4 pollsters in my data set, 108 polls, and 861 days from the first to last poll.

This modeling is based on the work of Simon Jackman in Bayesian Analysis for Social Sciences (2009).

The complete code base is available on my github site.