The model I currently run is very simple. It is based on four independent Gaussian processes for each primary vote series: Coalition, Labor, Greens and Others. In this regard it is very similar to the two-party-preferred (TPP) model.
The model has few internal constraints and it runs reasonably fast (in about 3 and half minutes). However the parameters are sampled independently, and only sum to 100 per cent in terms of the mean/median for each parameter. For the improved speed (and the absence of pesky diagnostics) this was a worthwhile compromise.
The Stan program includes a generated quantities code block in which we convert primary vote intentions to an estimated TPP vote share, based on preference flows at previous elections.
While the TPP model operates in the range 0 to 1 (with vote share values typically between 0.45 and 0.55), the new model centers all of the primary vote observations around 100. This ensures that for each party the analysis of their vote shares are well away from the edge of valid values for all model parameters. If we did not do this, the Greens vote share (often around 0.1) would be too close to the zero parameter boundary. Stan can get grumpy if it is being asked to estimate a parameter close to a boundary.
The key outputs from the new model follow. We will start with the primary vote share aggregation and an estimate of house effects for each party.
We can compare the primary vote shares for the major and minor parties.
And we can estimate the TPP vote shares for the Coalition and Labor, based on the preference flows from previous elections
The code for the latest model is as follows.
// STAN: Primary Vote Intention Model // Essentially a set of independent Gaussian processes from day-to-day // for each party's primary vote, centered around a mean of 100 data { // data size int<lower=1> n_polls; int<lower=1> n_days; int<lower=1> n_houses; int<lower=1> n_parties; real<lower=0> pseudoSampleSigma; // Centreing factors real<lower=0> center; real centreing_factors[n_parties]; // poll data real<lower=0> centered_obs_y[n_parties, n_polls]; // poll data int<lower=1,upper=n_houses> house[n_polls]; // polling house int<lower=1,upper=n_days> poll_day[n_polls]; // day on which polling occurred //exclude final n parties from the sum-to-zero constraint for houseEffects int<lower=0> n_exclude; // period of discontinuity and subsequent increased volatility event int<lower=1,upper=n_days> discontinuity; // start with a discontinuity int<lower=1,upper=n_days> stability; // end - stability restored // day-to-day change real<lower=0> sigma; real<lower=0> sigma_volatile; // TPP preference flows vector<lower=0,upper=1>[n_parties] preference_flows_2010; vector<lower=0,upper=1>[n_parties] preference_flows_2013; vector<lower=0,upper=1>[n_parties] preference_flows_2016; } transformed data { int<lower=1> n_include = (n_houses - n_exclude); } parameters { matrix[n_days, n_parties] centre_track; matrix[n_houses, n_parties] pHouseEffects; } transformed parameters { matrix[n_houses, n_parties] houseEffects; for(p in 1:n_parties) { houseEffects[1:n_houses, p] = pHouseEffects[1:n_houses, p] - mean(pHouseEffects[1:n_include, p]); } } model{ for (p in 1:n_parties) { // -- house effects model pHouseEffects[, p] ~ normal(0, 8.0); // weakly informative PRIOR // -- temporal model - with a discontinuity followed by increased volatility centre_track[1, p] ~ normal(center, 15); // weakly informative PRIOR centre_track[2:(discontinuity-1), p] ~ normal(centre_track[1:(discontinuity-2), p], sigma); centre_track[discontinuity, p] ~ normal(center, 15); // weakly informative PRIOR centre_track[(discontinuity+1):stability, p] ~ normal(centre_track[discontinuity:(stability-1), p], sigma_volatile); centre_track[(stability+1):n_days, p] ~ normal(centre_track[stability:(n_days-1), p], sigma); // -- observational model centered_obs_y[p,] ~ normal(houseEffects[house, p] + centre_track[poll_day, p], pseudoSampleSigma); } } generated quantities { matrix[n_days, n_parties] hidden_vote_share; vector [n_days] tpp2010; vector [n_days] tpp2013; vector [n_days] tpp2016; for (p in 1:n_parties) { hidden_vote_share[,p] = centre_track[,p] - centreing_factors[p]; } // aggregated TPP estimates based on past preference flows for (d in 1:n_days){ // note matrix transpose in next three lines tpp2010[d] = sum(hidden_vote_share'[,d] .* preference_flows_2010); tpp2013[d] = sum(hidden_vote_share'[,d] .* preference_flows_2013); tpp2016[d] = sum(hidden_vote_share'[,d] .* preference_flows_2016); } }
# PYTHON: analyse primary poll data import pandas as pd import numpy as np import pystan import pickle import sys sys.path.append( '../bin' ) from stan_cache import stan_cache # --- check version information print('Python version: {}'.format(sys.version)) print('pystan version: {}'.format(pystan.__version__)) # --- curate the data for the model # key settings intermediate_data_dir = "./Intermediate/" # analysis saved here # preference flows parties = ['L/NP', 'ALP', 'GRN', 'OTH'] preference_flows_2010 = [0.9975, 0.0, 0.2116, 0.5826] preference_flows_2013 = [0.9975, 0.0, 0.1697, 0.5330] preference_flows_2016 = [0.9975, 0.0, 0.1806, 0.5075] n_parties = len(parties) # polling data workbook = pd.ExcelFile('./Data/poll-data.xlsx') df = workbook.parse('Data') # drop pre-2016 election data df['MidDate'] = [pd.Period(d, freq='D') for d in df['MidDate']] df = df[df['MidDate'] > pd.Period('2016-07-04', freq='D')] # push One Nation into Other df['ONP'] = df['ONP'].fillna(0) df['OTH'] = df['OTH'] + df['ONP'] # set start date start = df['MidDate'].min() - 1 # the first date is day 1 df['Day'] = df['MidDate'] - start # day number for each poll n_days = df['Day'].max() # maximum days n_polls = len(df) # set discontinuity date - Turnbull's last day in office discontinuity = pd.Period('2018-08-23', freq='D') - start # UPDATE stability = pd.Period('2018-10-01', freq='D') - start # UPDATE # manipulate polling data ... y = df[parties] center = 100 centreing_factors = center - y.mean() y = y + centreing_factors # add polling house data to the mix # make sure the "sum to zero" exclusions are # last in the list houses = df['Firm'].unique().tolist() exclusions = ['YouGov', 'Ipsos'] # Note: we are excluding YouGov and Ipsos # from the sum to zero constraint because # they have unusual poll results compared # with other pollsters for e in exclusions: assert(e in houses) houses.remove(e) houses = houses + exclusions map = dict(zip(houses, range(1, len(houses)+1))) df['House'] = df['Firm'].map(map) n_houses = len(df['House'].unique()) n_exclude = len(exclusions) # sample metrics sampleSize = 1000 # treat all polls as being of this size pseudoSampleSigma = np.sqrt((50 * 50) / sampleSize) # --- compile model # get the STAN model with open ("./Models/primary simultaneous model.stan", "r") as f: model = f.read() f.close() # encode the STAN model in C++ sm = stan_cache(model_code=model) # --- fit the model to the data ct_init = np.full([n_days, n_parties], center*1.0) def initfun(): return dict(centre_track=ct_init) chains = 5 iterations = 2000 data = { 'n_days': n_days, 'n_polls': n_polls, 'n_houses': n_houses, 'n_parties': n_parties, 'pseudoSampleSigma': pseudoSampleSigma, 'centreing_factors': centreing_factors, 'centered_obs_y': y.T, 'poll_day': df['Day'].values.tolist(), 'house': df['House'].values.tolist(), 'n_exclude': n_exclude, 'center': center, 'discontinuity': discontinuity, 'stability': stability, # let's set the day-to-day smoothing 'sigma': 0.15, 'sigma_volatile': 0.4, # preference flows at past elections 'preference_flows_2010': preference_flows_2010, 'preference_flows_2013': preference_flows_2013, 'preference_flows_2016': preference_flows_2016 } fit = sm.sampling(data=data, iter=iterations, chains=chains, init=initfun, control={'max_treedepth':13}) results = fit.extract() # --- check diagnostics print('Stan Finished ...') import pystan.diagnostics as psd print(psd.check_hmc_diagnostics(fit)) # --- save the analysis with open(intermediate_data_dir + 'cat-' + 'output-primary-zero-sum.pkl', 'wb') as f: pickle.dump([df,sm,fit,results,data, centreing_factors, exclusions], f) f.close()
No comments:
Post a Comment