Saturday, November 17, 2018

Updated primary vote share model

For the 2019 election, I have explored a number of models for aggregating the primary vote shares, including models based on Dirichlet processes and centered logits. These older models were complicated and slow. They included constraints to ensure vote shares always summed to 100 per cent for every sample from the posterior distribution.

The model I currently run is very simple. It is based on four independent Gaussian processes for each primary vote series: Coalition, Labor, Greens and Others. In this regard it is very similar to the two-party-preferred (TPP) model.

The model has few internal constraints and it runs reasonably fast (in about 3 and half minutes). However the parameters are sampled independently, and only sum to 100 per cent in terms of the mean/median for each parameter. For the improved speed (and the absence of pesky diagnostics) this was a worthwhile compromise.

The Stan program includes a generated quantities code block in which we convert primary vote intentions to an estimated TPP vote share, based on preference flows at previous elections.

While the TPP model operates in the range 0 to 1 (with vote share values typically between 0.45 and 0.55), the new model centers all of the primary vote observations around 100. This ensures that for each party the analysis of their vote shares are well away from the edge of valid values for all model parameters. If we did not do this, the Greens vote share (often around 0.1) would be too close to the zero parameter boundary. Stan can get grumpy if it is being asked to estimate a parameter close to a boundary. 

The key outputs from the new model follow. We will start with the primary vote share aggregation and an estimate of house effects for each party.









We can compare the primary vote shares for the major and minor parties.



And we can estimate the TPP vote shares for the Coalition and Labor, based on the preference flows from previous elections





The code for the latest model is as follows.
// STAN: Primary Vote Intention Model
// Essentially a set of independent Gaussian processes from day-to-day
// for each party's primary vote, centered around a mean of 100

data {
    // data size
    int<lower=1> n_polls;
    int<lower=1> n_days;
    int<lower=1> n_houses;
    int<lower=1> n_parties;
    real<lower=0> pseudoSampleSigma;
    
    // Centreing factors 
    real<lower=0> center;
    real centreing_factors[n_parties];
    
    // poll data
    real<lower=0> centered_obs_y[n_parties, n_polls]; // poll data
    int<lower=1,upper=n_houses> house[n_polls]; // polling house
    int<lower=1,upper=n_days> poll_day[n_polls]; // day on which polling occurred

    //exclude final n parties from the sum-to-zero constraint for houseEffects
    int<lower=0> n_exclude;
    
    // period of discontinuity and subsequent increased volatility event
    int<lower=1,upper=n_days> discontinuity; // start with a discontinuity
    int<lower=1,upper=n_days> stability; // end - stability restored
    
    // day-to-day change
    real<lower=0> sigma;
    real<lower=0> sigma_volatile;

    // TPP preference flows
    vector<lower=0,upper=1>[n_parties] preference_flows_2010;
    vector<lower=0,upper=1>[n_parties] preference_flows_2013;
    vector<lower=0,upper=1>[n_parties] preference_flows_2016;
}

transformed data {
    int<lower=1> n_include = (n_houses - n_exclude);
}

parameters {
    matrix[n_days, n_parties] centre_track;
    matrix[n_houses, n_parties] pHouseEffects;
}

transformed parameters {
    matrix[n_houses, n_parties] houseEffects;
    for(p in 1:n_parties) {
        houseEffects[1:n_houses, p] = pHouseEffects[1:n_houses, p] - 
            mean(pHouseEffects[1:n_include, p]);
    }
}

model{
    for (p in 1:n_parties) {
        // -- house effects model
        pHouseEffects[, p] ~ normal(0, 8.0); // weakly informative PRIOR
        
        // -- temporal model - with a discontinuity followed by increased volatility
        centre_track[1, p] ~ normal(center, 15); // weakly informative PRIOR
        centre_track[2:(discontinuity-1), p] ~ 
            normal(centre_track[1:(discontinuity-2), p], sigma);
        centre_track[discontinuity, p] ~ normal(center, 15); // weakly informative PRIOR
        centre_track[(discontinuity+1):stability, p] ~ 
            normal(centre_track[discontinuity:(stability-1), p], sigma_volatile);
        centre_track[(stability+1):n_days, p] ~ 
            normal(centre_track[stability:(n_days-1), p], sigma);

        // -- observational model
        centered_obs_y[p,] ~ normal(houseEffects[house, p] + 
            centre_track[poll_day, p], pseudoSampleSigma);
    }
}

generated quantities {
    matrix[n_days, n_parties]  hidden_vote_share;
    vector [n_days] tpp2010;
    vector [n_days] tpp2013;
    vector [n_days] tpp2016;
    
    for (p in 1:n_parties) {
        hidden_vote_share[,p] = centre_track[,p] - centreing_factors[p];
    }
    
    // aggregated TPP estimates based on past preference flows
    for (d in 1:n_days){
        // note matrix transpose in next three lines
        tpp2010[d] = sum(hidden_vote_share'[,d] .* preference_flows_2010);
        tpp2013[d] = sum(hidden_vote_share'[,d] .* preference_flows_2013);
        tpp2016[d] = sum(hidden_vote_share'[,d] .* preference_flows_2016);
    }
} 
The Python program to run this Stan model follows. 
# PYTHON: analyse primary poll data

import pandas as pd
import numpy as np
import pystan
import pickle

import sys
sys.path.append( '../bin' )
from stan_cache import stan_cache

# --- check version information
print('Python version: {}'.format(sys.version))
print('pystan version: {}'.format(pystan.__version__))

# --- curate the data for the model
# key settings
intermediate_data_dir = "./Intermediate/" # analysis saved here

# preference flows
parties  =              ['L/NP', 'ALP', 'GRN', 'OTH']
preference_flows_2010 = [0.9975, 0.0, 0.2116, 0.5826]
preference_flows_2013 = [0.9975, 0.0, 0.1697, 0.5330]
preference_flows_2016 = [0.9975, 0.0, 0.1806, 0.5075]
n_parties = len(parties)

# polling data
workbook = pd.ExcelFile('./Data/poll-data.xlsx')
df = workbook.parse('Data')

# drop pre-2016 election data
df['MidDate'] = [pd.Period(d, freq='D') for d in df['MidDate']]
df = df[df['MidDate'] > pd.Period('2016-07-04', freq='D')] 

# push One Nation into Other 
df['ONP'] = df['ONP'].fillna(0)
df['OTH'] = df['OTH'] + df['ONP']

# set start date
start = df['MidDate'].min() - 1 # the first date is day 1
df['Day'] = df['MidDate'] - start # day number for each poll
n_days = df['Day'].max() # maximum days 
n_polls = len(df)

# set discontinuity date - Turnbull's last day in office
discontinuity = pd.Period('2018-08-23', freq='D') - start # UPDATE
stability = pd.Period('2018-10-01', freq='D') - start # UPDATE


# manipulate polling data ... 
y = df[parties]
center = 100
centreing_factors = center - y.mean()
y = y + centreing_factors

# add polling house data to the mix
# make sure the "sum to zero" exclusions are 
# last in the list
houses = df['Firm'].unique().tolist()
exclusions = ['YouGov', 'Ipsos']
# Note: we are excluding YouGov and Ipsos 
# from the sum to zero constraint because 
# they have unusual poll results compared 
# with other pollsters
for e in exclusions:
    assert(e in houses)
    houses.remove(e)
houses = houses + exclusions
map = dict(zip(houses, range(1, len(houses)+1)))
df['House'] = df['Firm'].map(map)
n_houses = len(df['House'].unique())
n_exclude = len(exclusions)

# sample metrics
sampleSize = 1000 # treat all polls as being of this size
pseudoSampleSigma = np.sqrt((50 * 50) / sampleSize) 

# --- compile model

# get the STAN model 
with open ("./Models/primary simultaneous model.stan", "r") as f:
    model = f.read()
    f.close()

# encode the STAN model in C++ 
sm = stan_cache(model_code=model)


# --- fit the model to the data
ct_init = np.full([n_days, n_parties], center*1.0)
def initfun():
    return dict(centre_track=ct_init)

chains = 5
iterations = 2000
data = {
        'n_days': n_days,
        'n_polls': n_polls,
        'n_houses': n_houses,
        'n_parties': n_parties,
        'pseudoSampleSigma': pseudoSampleSigma,
        'centreing_factors': centreing_factors,
    
        'centered_obs_y': y.T, 
        'poll_day': df['Day'].values.tolist(),
        'house': df['House'].values.tolist(), 
        'n_exclude': n_exclude,
        'center': center,
        'discontinuity': discontinuity,
        'stability': stability,
        
        # let's set the day-to-day smoothing 
        'sigma': 0.15,
        'sigma_volatile': 0.4,
        
        # preference flows at past elections
        'preference_flows_2010': preference_flows_2010,
        'preference_flows_2013': preference_flows_2013,
        'preference_flows_2016': preference_flows_2016
}
    
fit = sm.sampling(data=data, iter=iterations, chains=chains, 
    init=initfun, control={'max_treedepth':13})
results = fit.extract()

# --- check diagnostics
print('Stan Finished ...')
import pystan.diagnostics as psd
print(psd.check_hmc_diagnostics(fit))

# --- save the analysis
with open(intermediate_data_dir + 'cat-' +
    'output-primary-zero-sum.pkl', 'wb') as f:
    pickle.dump([df,sm,fit,results,data,
        centreing_factors, exclusions], f)
    f.close()

No comments:

Post a Comment