### General overview

The aggregation or data fusion models I use are probably best described as state space models or latent process models. They are also known as hidden Markov models.I model the national voting intention (which cannot be observed directly; it is "hidden") for each and every day of the period under analysis. The only time the national voting intention is not hidden, is at an election. In some models (known as anchored models), we use the election result to anchor the daily model we use.

In the language of modelling, our estimates of the national voting intention for each day being modeled are known as states. These "states" link together to form a process where each state is directly dependent on the previous state and a probability distribution linking the states. In plain English, the models assume that the national voting intention today is much like it was yesterday.

The model is informed by irregular and noisy data from the selected polling houses. The challenge for the model is to ignore the noise and find the underlying signal. In effect, the model is solved by finding the the day-to-day pathway with the maximum likelihood given the known poll results.

To improve the robustness of the model, we make provision for the long-run tendency of each polling house to systematically favour either the Coalition or Labor. We call this small tendency to favour one side or the other a "house effect". The model assumes that the results from each pollster diverge (on average) from the from real population voting intention by a small, constant number of percentage points. We use the calculated house effect to adjust the raw polling data from each polling house.

In estimating the house effects, we can take one of a number of approaches. We could:

- anchor the model to an election result on a particular day, and use that anchoring to establish the house effects.
- anchor the model to a particular polling house or houses; or
- assume that collectively the polling houses are unbiased, and that collectively their house effects sum to zero.

Currently, I tend to favour the third approach in my models.

The problem with anchoring the model to an election outcome (or to a particular polling house), is that pollsters are constantly reviewing and, from time to time, changing their polling practice. Over time these changes affect the reliability of anchored models. On the other hand, the sum-to-zero assumption is rarely correct. Nonetheless, in previous elections, those people who used models that were anchored to the previous election did poorer than those people whose models averaged the bias across all polling houses.

Solving a model necessitates integration over a series of complex multidimensional probability distributions. The definite integral is typically impossible to solve algebraically. But it can be solved using a numerical method based on Markov chains and random numbers known as Markov Chain Monte Carlo (MCMC) integration. I use a free software product called Stan to solve these models.

###
**Model for TPP voting intention with house effects summed to zero**

This is the simplest model. It has three parts:- The
*observed data model*or*measurement model*assumes two factors explain the difference between published poll results (what we observe/what we measure) and the national voting intention on a particular day (which, with the exception of elections, is hidden):

- The first factor is the margin of error from classical statistics. This is the random error associated with selecting a sample - however, because I have not collected sample information I have assumed all surveys are of the same size; and
- The second factor is the systemic biases (house effects) that affect each pollster's published estimate of the population voting intention.

- The
*temporal*part of the model assumes that the actual population voting intention on any day is much the same as it was on the previous day. The model estimates the (hidden) population voting intention for every day under analysis. - The
*house effects*part of the model assumes that house effects from a core set of pollsters sum to zero. Typically I place all pollsters in this core set. The polling data from any houses not in the core set affects the shape of the aggregate poll estimate, but not its vertical positioning on the chart.

This model is based on original work by Professor Simon Jackman. It takes advantage of Stan's vectorised operations. And Stan runs the 5 chains concurrently in under 90 seconds on my machine (a virtual Linux machine on a Windows based Ryzen 1800X). I use a virtual Linux machine because Stan does not run its chains in parallel under Windows.

// STAN: Two-Party Preferred (TPP) Vote Intention Model data { // data size int<lower=1> n_polls; int<lower=1> n_days; int<lower=1> n_houses; // assumed standard deviation for all polls real<lower=0> pseudoSampleSigma; // we are only going to normalise house effects from first n houses int<lower=1,upper=n_houses> n_core_set; // poll data vector<lower=0,upper=1>[n_polls] y; // TPP vote share int<lower=1> house[n_polls]; int<lower=1> day[n_polls]; } transformed data { // specify sigma in this block if you do not // want the model to derive a value for // the standard deviation on day-to-day // changes in the value of hidden_vote_share // note: smaller values of sigma are more smoothed // real sigma = 0.001; } parameters { vector[n_days] hidden_vote_share; vector[n_houses] pHouseEffects; // speicfy sigma here for a model derived value real<lower=0> sigma; // SD of day-to-day change } transformed parameters { vector[n_houses] houseEffect; // house effects sum to zero over the first n_core_set houses // this allows you to specify a "trusted" set of pollsters houseEffect[1:n_core_set] = pHouseEffects[1:n_core_set] - mean(pHouseEffects[1:n_core_set]); if(n_core_set < n_houses) houseEffect[(n_core_set+1):n_houses] = pHouseEffects[(n_core_set+1):n_houses]; } model { // -- temporal model [this is the hidden state-space model] // comment out the next line if sigma is not model derived sigma ~ cauchy(0, 0.0025); // half cauchy prior hidden_vote_share[1] ~ normal(0.5, 0.0333); // prior: TPP between 40% and 60% hidden_vote_share[2:n_days] ~ normal(hidden_vote_share[1:(n_days-1)], sigma); // -- house effects model pHouseEffects ~ normal(0, 0.025); // prior expect up to +/- 5 percentage points // -- observed data / measurement model y ~ normal(houseEffect[house] + hidden_vote_share[day], pseudoSampleSigma); }

For those of you who are worried that sigma in the temporal model might be overly constrained (where sigma is the standard deviation in voting intention change from one day to the next), it lies well within the model constraints. The chart below shows the samples for sigma lie typically between 0.0002 and 0.002 on the unit scale (0.02 to 0.2 percentage points). Our prior for sigma was a half Cauchy, with half the distribution between 0 and 0.005 (0 to 0.5 percentage points).

The supporting python code for running this model is as follows. Note: I have a further python program for generating the charts from the saved analysis.

# PYTHON: analyse TPP poll data import pandas as pd import numpy as np import pystan import pickle import sys sys.path.append( '../bin' ) from stan_cache import stan_cache # --- version information print('Python version: {}'.format(sys.version)) print('pystan version: {}'.format(pystan.__version__)) # --- key inputs to model sampleSize = 1000 # treat all polls as being of this size pseudoSampleSigma = np.sqrt((0.5 * 0.5) / sampleSize) chains = 5 iterations = 1000 # Note: half of the iterations will be warm-up # --- collect the model data # the XL data file was extracted from the Wikipedia # page on next Australian Federal Election workbook = pd.ExcelFile('./Data/poll-data.xlsx') df = workbook.parse('Data') # drop pre-2016 election data df['MidDate'] = [pd.Period(date, freq='D') for date in df['MidDate']] df = df[df['MidDate'] > pd.Period('2016-07-04', freq='D')] # covert dates to days from start start = df['MidDate'].min() - 1 # day zero df['Day'] = df['MidDate'] - start # day number for each poll n_days = df['Day'].max() n_polls = len(df) # get houses - specify the core set that are # used for the houseEffect sum-to-zero constraint core_set = ['Essential', 'ReachTEL', 'Newspoll', 'YouGov', 'Ipsos', 'Roy Morgan'] n_core_set = len(core_set) for h in df['Firm'].unique(): if h not in core_set: core_set.append(h) map = {} i = 1 for h in core_set: map[h] = i i += 1 df['House'] = df['Firm'].map(map) n_houses = len(df['House'].unique()) # batch up data = { 'n_days': n_days, 'n_polls': n_polls, 'n_houses': n_houses, 'pseudoSampleSigma': pseudoSampleSigma, 'n_core_set': n_core_set, 'y': (df['TPP L/NP'] / 100.0).values, 'day': df['Day'].astype(int).values, 'house': df['House'].astype(int).values } # --- get the STAN model with open ("./Models/TPP model.stan", "r") as f: model = f.read() f.close() # --- compile/retrieve model and run samples sm = stan_cache(model_code=model) fit = sm.sampling(data=data, iter=iterations, chains=chains) results = fit.extract() # --- save analysis intermediate_data_dir = "./Intermediate/" with open(intermediate_data_dir + 'output-TPP-zero-sum.pkl', 'wb') as f: pickle.dump([results,df,data], f)

An example of the output from this model (as at 30 March 2018) follows. This chart shows the Coalition's estimated two-party preferred (TPP) hidden vote share for each day since early July 2016.

The next chart shows the estimates house bias for each house, given the assumption that the bias from the core set (in this case all six pollsters) sums to zero.

###
**Model for primary voting intention with house effects summed to zero**

The model for primary voting intention is conceptually similar to the TPP model. Here we are estimating the hidden vote share for the four key party groups simultaneously. To do this, the model was changed as follows:- The polling data in the
*observed data*or*measurement model*is expressed as a two dimensional array. There is one row for each opinion poll. These rows comprises four integers (a multinomial) that sums to the (assumed) sample size for each poll. The columns in this array are for each party: Coalition, Labor, Greens and Others. This model uses the multinomial distribution to fit the estimated hidden vote share intentions to the polling data. In addition to the raw poll data, I have two arrays that the provide further information on each poll. The house array, indicates which pollster conducted the poll. The poll_day array holds the mid=point of the polling period for the poll. - Because the hidden vote shares across the four party groups sums to one, there is redundant information. Consequently, the
*temporal*part of the model is only developed in respect of the first three parties. The vote share for the final party can be found simply as follows $$partyVoteShare_{n}=1-\sum_{i=1}^{n-1}partyVoteShare_i$$Because some parties have a small vote share, close to the zero-boundary, we undertake the temporal analysis using centered log ratios or centred logits. The conversion of vote shares to logits removes the boundary condition. Centering the logits means that the day-to-day changes for each series can share the same standard deviation. Like the TPP model, the day-to-day change in hidden vote shares is assumed to be normally distributed around zero. You will note in the transformed parameters section of the Stan program (below) we convert the \(n-1\) centred logits to an \(n\) simplex of hidden vote share intentions. A simplex is a mathematical grouping where all values in the simplex are between 0 and 1, and collectively they sum to 1. This simplex is then used in the measurement model (as noted above). - The
*house effects*part of the model is doubly constrained. For each polling house, the effects sum to zero across the four parties. The house effects for each party also sum to zero across the (currently) six houses. This double constraint is necessary to consistently anchor the result.

// STAN: Primary Vote Intention Model using Centred Logits data { // data size int<lower=1> n_polls; int<lower=1> n_days; int<lower=1> n_houses; int<lower=1> n_parties; int<lower=1> pseudoSampleSize; // Centreing factors real centreing_factors[n_parties-1]; // poll data int<lower=1,upper=pseudoSampleSize> y[n_polls, n_parties]; // poll data multinomials int<lower=1,upper=n_houses> house[n_polls]; // polling house int<lower=1,upper=n_days> poll_day[n_polls]; // day on which polling occurred // TPP preference flows row_vector<lower=0,upper=1>[n_parties] preference_flows_2010; row_vector<lower=0,upper=1>[n_parties] preference_flows_2013; row_vector<lower=0,upper=1>[n_parties] preference_flows_2016; } transformed data { // specify sigma in this block if you do not // want the model to derive a value for // the standard deviation on day-to-day // changes in the value of hidden_vote_share // note: smaller values of sigma are more smoothed // real sigma = 0.003; // Note on the logit scale near the origin } parameters { row_vector[n_days] centredLogits[n_parties-1]; matrix[n_houses-1, n_parties-1] houseAdjustment; // comment out the next line if you do not want the model to find sigma real<lower=0> sigma; } transformed parameters { matrix[n_parties, n_days] hidden_voting_intention; vector<lower=-0.2,upper=0.2>[n_parties] tHouseAdjustment[n_houses]; row_vector[n_days] tmp; // -- house effects - two-direction sum to zero constraints for (h in 1:(n_houses-1)) for(p in 1:(n_parties-1)) tHouseAdjustment[h][p] = houseAdjustment[h][p]; for(p in 1:(n_parties-1)) tHouseAdjustment[n_houses][p] = -sum(col(houseAdjustment, p)); for(h in 1:n_houses) { tHouseAdjustment[h][n_parties] = 0; // get rid of the NAN tHouseAdjustment[h][n_parties] = -sum(tHouseAdjustment[h]); } // -- convert centred logits to a simplex of hidden voting intentions tmp = rep_row_vector(0, n_days); for (p in 1:(n_parties-1)) { hidden_voting_intention[p] = inv_logit(centredLogits[p]) + centreing_factors[p]; tmp = tmp + hidden_voting_intention[p]; } hidden_voting_intention[n_parties] = 1.0 - tmp; } model{ matrix[n_parties, n_polls] hvi_on_poll_day; // -- house effects model for( h in 1:(n_houses-1) ) houseAdjustment[h] ~ normal(0, 0.015); // -- temporal model - [AKA the hidden state-space model] // Note: 0.02 near the centre --> roughly std dev of half a per cent // Comment out the next line if you do not want the model to find sigma sigma ~ normal(0, 0.02); // half normal prior - note: on logit scale; for(p in 1:(n_parties-1)) { centredLogits[p][1] ~ normal(0, 0.15); // centred starting point 50% +/- 5% centredLogits[p][2:n_days] ~ normal(centredLogits[p][1:(n_days-1)], sigma); } // -- observed data model [AKA the measurement model] for(p in 1:n_parties) hvi_on_poll_day[p] = hidden_voting_intention[p][poll_day]; for(poll in 1:n_polls) // note matrix transpose in the next statement ... y[poll] ~ multinomial(to_vector(hvi_on_poll_day'[poll]) + tHouseAdjustment[house[poll]]); } generated quantities { // aggregated TPP estimates based on past preference flows vector [n_days] tpp2010; vector [n_days] tpp2013; vector [n_days] tpp2016; for (d in 1:n_days){ // note matrix transpose in next three lines tpp2010[d] = sum(hidden_voting_intention'[d] .* preference_flows_2010); tpp2013[d] = sum(hidden_voting_intention'[d] .* preference_flows_2013); tpp2016[d] = sum(hidden_voting_intention'[d] .* preference_flows_2016); } }

The Python program to run this Stan model follows. The key thing to note here is that we initialise a number of the key variables in Stan to zero. The model does not run without this step. Even with this step it throws quite a few warning messages early in the warmup stage.

# PYTHON: analyse primary logit poll data import pandas as pd import numpy as np import pystan import pickle import sys sys.path.append( '../bin' ) from stan_cache import stan_cache # --- check version information print('Python version: {}'.format(sys.version)) print('pystan version: {}'.format(pystan.__version__)) # --- curate the data for the model # key settings intermediate_data_dir = "./Intermediate/" # analysis saved here pseudoSampleSize = 1000 # preference flows parties = ['L/NP', 'ALP', 'GRN', 'OTH'] preference_flows_2010 = [0.9975, 0.0, 0.2116, 0.5826] preference_flows_2013 = [0.9975, 0.0, 0.1697, 0.5330] preference_flows_2016 = [0.9975, 0.0, 0.1806, 0.5075] # polling data workbook = pd.ExcelFile('./Data/poll-data.xlsx') df = workbook.parse('Data') # drop pre-2016 election data df['MidDate'] = [pd.Period(d, freq='D') for d in df['MidDate']] df = df[df['MidDate'] > pd.Period('2016-07-04', freq='D')] # push One Nation into Other df['ONP'] = df['ONP'].fillna(0) df['OTH'] = df['OTH'] + df['ONP'] # set start date start = df['MidDate'].min() - 1 # the first date is day 1 df['Day'] = df['MidDate'] - start # day number for each poll n_days = df['Day'].max() # maximum days n_polls = len(df) # convert poll results to multinomials y = df[parties].div(100.0) y = y.div(y.sum(axis=1), axis=0) # normalise rows y = y.mul(pseudoSampleSize).round(0).astype(int) edit = pseudoSampleSize - y.sum(axis=1) y[y.columns[-1]] += edit # ensure row-sum integrity n_parties = len(y.columns) centreing_factors = np.zeros(n_parties-1) for i in range(n_parties-1): centreing_factors[i] = (y[y.columns[i]].mean() / pseudoSampleSize) - 0.5 print('Centreing factors: {}'.format(centreing_factors)) # add polling house data to the mix df['House'] = pd.Categorical(df['Firm']).codes + 1 n_houses = len(df['House'].unique()) # batch up into a dictionary data = { 'n_days': n_days, 'n_polls': n_polls, 'n_houses': n_houses, 'n_parties': n_parties, 'pseudoSampleSize': pseudoSampleSize, 'centreing_factors': centreing_factors, 'y': y.astype(int).values, 'poll_day': df['Day'].values.tolist(), 'house': df['House'].values.tolist(), 'preference_flows_2010': preference_flows_2010, 'preference_flows_2013': preference_flows_2013, 'preference_flows_2016': preference_flows_2016 } #print('Polling data looks like this\n{}'.format(data['y'])) # --- compile model # get the STAN model with open ("./Models/primary logit model.stan", "r") as f: model = f.read() f.close() # encode the STAN model in C++ sm = stan_cache(model_code=model, model_name='primary_logit') # --- fit the model to the data # initialisation chains = 5 iterations = 1000 #hvi_init = np.zeros((n_span, n_parties)) #for d in range(n_span): # for p in range(n_parties): # hvi_init[d, p] = startingPoint[p] #def initfun(): # return dict(hidden_voting_intention=hvi_init, # houseAdjustment=ha_init) cl_init = np.zeros((n_parties-1, n_days)) ha_init = np.zeros((n_houses-1, (n_parties-1))) def initfun(): return dict(centredLogits=cl_init, houseAdjustment=ha_init) # fit the model to the data fit = sm.sampling(data=data, iter=iterations, chains=chains, init=initfun) results = fit.extract() # save the analysis with open(intermediate_data_dir + 'output-primary-logit-zero-sum.pkl', 'wb') as f: pickle.dump([results,df,data,parties], f) f.close()

Final note: I explored a Dirichlet process in Stan, similar to the one I developed in JAGS for the 2016 election. I am not using that model as it takes some 30 minutes to run (ie. it more than three times slower than the centred logit model above). You can see the Stan version of the Dirichlet model here.