### General overview

The aggregation or data fusion models I use are probably best described as state space models or latent process models. They are also known as hidden Markov models.I model the national voting intention (which cannot be observed directly; it is "hidden") for each and every day of the period under analysis. The only time the national voting intention is not hidden, is at an election. In some models (known as anchored models), we use the election result to anchor the daily model we use.

In the language of modelling, our estimates of the national voting intention for each day being modeled are known as states. These "states" link together to form a process where each state is directly dependent on the previous state and a probability distribution linking the states. In plain English, the models assume that the national voting intention today is much like it was yesterday.

The model is informed by irregular and noisy data from the selected polling houses. The challenge for the model is to ignore the noise and find the underlying signal. In effect, the model is solved by finding the the day-to-day pathway with the maximum likelihood given the known poll results.

To improve the robustness of the model, we make provision for the long-run tendency of each polling house to systematically favour either the Coalition or Labor. We call this small tendency to favour one side or the other a "house effect". The model assumes that the results from each pollster diverge (on average) from the from real population voting intention by a small, constant number of percentage points. We use the calculated house effect to adjust the raw polling data from each polling house.

In estimating the house effects, we can take one of a number of approaches. We could:

- anchor the model to an election result on a particular day, and use that anchoring to establish the house effects.
- anchor the model to a particular polling house or houses; or
- assume that collectively the polling houses are unbiased, and that collectively their house effects sum to zero.

The problem with anchoring the model to an election outcome (or to a particular polling house), is that pollsters are constantly reviewing and, from time to time, changing their polling practice. Over time these changes affect the reliability of anchored models. On the other hand, the sum-to-zero assumption is rarely correct. Nonetheless, in previous elections, those people who used models that were anchored to the previous election did poorer than those people whose models averaged the bias across all polling houses.

Solving a model necessitates integration over a series of complex multidimensional probability distributions. The definite integral is typically impossible to solve algebraically. But it can be solved using a numerical method based on Markov chains and random numbers known as Markov Chain Monte Carlo (MCMC) integration. I use a free software product called Stan to solve these models.

###
**Gaussian process model for TPP voting intention**

This is the simplest model. It has three parts:- The
*observed data model*or*measurement model*assumes three factors explain the difference between published poll results (what we observe/what we measure) and the national voting intention on a particular day (which, with the exception of elections, is hidden):

- The first factor is the margin of error from classical statistics. This is the random error associated with selecting a sample - however, because I have not collected sample information I have assumed all surveys are of the same size; and
- The second factor is the systemic biases (house effects) that affect each pollster's published estimate of the population voting intention.
- I also make an adjustment for the pollsters that are more noisy than their peers. These pollsters are consequently, less influential on the aggregation.

- The
*temporal*part of the model assumes that the actual population voting intention on any day is much the same as it was on the previous day. The model estimates the (hidden) population voting intention for every day under analysis. - The
*house effects*part of the model assumes that house effects from a core set of pollsters sum to zero. Typically I exclude some pollsters in this core set. The polling data from any houses not in the core set affects the shape of the aggregate poll estimate, but not its vertical positioning on the chart.

This model is based on original work by Professor Simon Jackman. It takes advantage of Stan's vectorised operations. And Stan runs the 5 chains concurrently in under 32 seconds on my machine (a virtual Linux machine on a Windows based Ryzen 1800X).

// STAN: Two-Party Preferred (TPP) Vote Intention Model // - Updated to allow for a discontinuity event (Turnbull -> Morrison) // - Updated to exclude some houses from the sum to zero constraint // - Updated to reduce the influence of the noisier polling houses data { // data size int<lower=1> n_polls; int<lower=1> n_days; int<lower=1> n_houses; // assumed standard deviation for all polls real<lower=0> pseudoSampleSigma; // poll data vector<lower=0,upper=1>[n_polls] y; // TPP vote share int<lower=1> house[n_polls]; int<lower=1> day[n_polls]; vector<lower=0> [n_polls] poll_qual_adj; // poll quality adjustment // period of discontinuity event int<lower=1,upper=n_days> discontinuity; int<lower=1,upper=n_days> stability; // exclude final n houses from the house // effects sum to zero constraint. int<lower=0> n_exclude; } transformed data { // fixed day-to-day standard deviation real sigma = 0.0015; real sigma_volatile = 0.0045; int<lower=1> n_include = (n_houses - n_exclude); } parameters { vector[n_days] hidden_vote_share; vector[n_houses] pHouseEffects; } transformed parameters { vector[n_houses] houseEffect; houseEffect[1:n_houses] = pHouseEffects[1:n_houses] - mean(pHouseEffects[1:n_include]); } model { // -- temporal model [this is the hidden state-space model] hidden_vote_share[1] ~ normal(0.5, 0.15); // PRIOR hidden_vote_share[discontinuity] ~ normal(0.5, 0.15); // PRIOR hidden_vote_share[2:(discontinuity-1)] ~ normal(hidden_vote_share[1:(discontinuity-2)], sigma); hidden_vote_share[(discontinuity+1):stability] ~ normal(hidden_vote_share[discontinuity:(stability-1)], sigma_volatile); hidden_vote_share[(stability+1):n_days] ~ normal(hidden_vote_share[stability:(n_days-1)], sigma); // -- house effects model pHouseEffects ~ normal(0, 0.08); // PRIOR // -- observed data / measurement model y ~ normal(houseEffect[house] + hidden_vote_share[day], pseudoSampleSigma + poll_qual_adj); }

The supporting python code for running this model is as follows. Note: I have a further python program for generating the charts from the saved analysis.

# PYTHON: analyse TPP poll data import pandas as pd import numpy as np import pystan import pickle import sys sys.path.append( '../bin' ) from stan_cache import stan_cache # --- version information print('Python version: {}'.format(sys.version)) print('pystan version: {}'.format(pystan.__version__)) # --- key inputs to model sampleSize = 1000 # treat all polls as being of this size pseudoSampleSigma = np.sqrt((0.5 * 0.5) / sampleSize) chains = 5 iterations = 2000 # Note: half of the iterations will be warm-up # --- collect the model data # the XL data file was extracted from the Wikipedia # page on next Australian Federal Election workbook = pd.ExcelFile('./Data/poll-data.xlsx') df = workbook.parse('Data') # drop pre-2016 election data df['MidDate'] = [pd.Period(date, freq='D') for date in df['MidDate']] df = df[df['MidDate'] > pd.Period('2016-07-04', freq='D')] # covert dates to days from start start = df['MidDate'].min() - 1 # day zero df['Day'] = df['MidDate'] - start # day number for each poll n_days = df['Day'].max() n_polls = len(df) # Update for discontinuity - choosen Turnbull's last day in office ... # with six weeks of higher day-to-day variability following discontinuity = pd.Period('2018-08-23', freq='D') - start # UPDATE stability = pd.Period('2018-10-01', freq='D') - start # UPDATE # treat later Newspoll as a seperate series # [Because newspoll changed its preference allocation methodology] df['Firm'] = df['Firm'].where((df['MidDate'] < pd.Period('2017-12-01', freq='D')) | (df['Firm'] != 'Newspoll'), other='Newspoll2') # add polling house data to the mix # make sure the sum-to-zero exclusions are last in the list # exclude from sum-to-zero on the basis of being outliers to the other houses houses = df['Firm'].unique().tolist() exclusions = ['Roy Morgan', 'YouGov'] for e in exclusions: assert(e in houses) houses.remove(e) houses = houses + exclusions map = dict(zip(houses, range(1, len(houses)+1))) df['House'] = df['Firm'].map(map) n_houses = len(df['House'].unique()) n_exclude = len(exclusions) # batch up data = { 'n_days': n_days, 'n_polls': n_polls, 'n_houses': n_houses, 'pseudoSampleSigma': pseudoSampleSigma, 'y': (df['TPP L/NP'] / 100.0).values, 'day': df['Day'].astype(int).values, 'house': df['House'].astype(int).values, 'discontinuity': discontinuity, 'stability': stability, 'n_exclude': n_exclude } # --- get the STAN model with open ("./Models/TPP model.stan", "r") as f: model = f.read() f.close() # --- compile/retrieve model and run samples sm = stan_cache(model_code=model) fit = sm.sampling(data=data, iter=iterations, chains=chains, control={'max_treedepth':12}) results = fit.extract() # --- check diagnostics print(fit.stansummary()) import pystan.diagnostics as psd print(psd.check_hmc_diagnostics(fit)) # --- save analysis intermediate_data_dir = "./Intermediate/" with open(intermediate_data_dir + 'output-TPP-zero-sum.pkl', 'wb') as f: pickle.dump([results,df,data,exclusions], f)

**Gaussian process model for primary voting intention**The primary vote model I usually run is very simple. It is based on four independent Gaussian processes for each primary vote series: Coalition, Labor, Greens and Others. In this regard it is very similar to the TPP model above. The model has few internal constraints and it runs reasonably fast (in about 3 and half minutes). However the parameters are sampled independently, and only sum to 100 per cent in terms of the mean/median for each parameter. For the speed (see the Dirichlet-multinomial model below) this was a worthwhile compromise.

The Stan program includes a generated quantities code block in which we convert primary vote intentions to an estimated TPP vote share, based on preference flows at previous elections.

// STAN: Primary Vote Intention Model // Essentially a set of independent Gaussian processes from day-to-day // for each party's primary vote, centered around a mean of 100 data { // data size int<lower=1> n_polls; int<lower=1> n_days; int<lower=1> n_houses; int<lower=1> n_parties; real<lower=0> pseudoSampleSigma; // Centreing factors real<lower=0> center; real centreing_factors[n_parties]; // poll data real<lower=0> centered_obs_y[n_parties, n_polls]; // poll data int<lower=1,upper=n_houses> house[n_polls]; // polling house int<lower=1,upper=n_days> poll_day[n_polls]; // day on which polling occurred vector<lower=0> [n_polls] poll_qual_adj; // poll quality adjustment //exclude final n parties from the sum-to-zero constraint for houseEffects int<lower=0> n_exclude; // period of discontinuity and subsequent increased volatility event int<lower=1,upper=n_days> discontinuity; // start with a discontinuity int<lower=1,upper=n_days> stability; // end - stability restored // day-to-day change real<lower=0> sigma; real<lower=0> sigma_volatile; // TPP preference flows vector<lower=0,upper=1>[n_parties] preference_flows_2010; vector<lower=0,upper=1>[n_parties] preference_flows_2013; vector<lower=0,upper=1>[n_parties] preference_flows_2016; } transformed data { int<lower=1> n_include = (n_houses - n_exclude); } parameters { matrix[n_days, n_parties] centre_track; matrix[n_houses, n_parties] pHouseEffects; } transformed parameters { matrix[n_houses, n_parties] houseEffects; for(p in 1:n_parties) { houseEffects[1:n_houses, p] = pHouseEffects[1:n_houses, p] - mean(pHouseEffects[1:n_include, p]); } } model{ for (p in 1:n_parties) { // -- house effects model pHouseEffects[, p] ~ normal(0, 8.0); // weakly informative PRIOR // -- temporal model - with a discontinuity followed by increased volatility centre_track[1, p] ~ normal(center, 15); // weakly informative PRIOR centre_track[2:(discontinuity-1), p] ~ normal(centre_track[1:(discontinuity-2), p], sigma); centre_track[discontinuity, p] ~ normal(center, 15); // weakly informative PRIOR centre_track[(discontinuity+1):stability, p] ~ normal(centre_track[discontinuity:(stability-1), p], sigma_volatile); centre_track[(stability+1):n_days, p] ~ normal(centre_track[stability:(n_days-1), p], sigma); // -- observational model centered_obs_y[p,] ~ normal(houseEffects[house, p] + centre_track[poll_day, p], pseudoSampleSigma + poll_qual_adj); } } generated quantities { matrix[n_days, n_parties] hidden_vote_share; vector [n_days] tpp2010; vector [n_days] tpp2013; vector [n_days] tpp2016; for (p in 1:n_parties) { hidden_vote_share[,p] = centre_track[,p] - centreing_factors[p]; } // aggregated TPP estimates based on past preference flows for (d in 1:n_days){ // note matrix transpose in next three lines tpp2010[d] = sum(hidden_vote_share'[,d] .* preference_flows_2010); tpp2013[d] = sum(hidden_vote_share'[,d] .* preference_flows_2013); tpp2016[d] = sum(hidden_vote_share'[,d] .* preference_flows_2016); } }

The key thing to note above is that the day-to-day pathway for each party's primary vote share has been centered on 100. This ensures the analysis is occurring well away from a parameter constraint. For example, if we do the analysis on the simplex - between 0 and 1 - then the green vote of typically 0.1 is very close to the boundary condition 0. Analysis close to a boundary can give Stan indigestion - typically in the form of constraint diagnostics, BFMI/energy diagnostics, Rhat diagnostics, and the bane of every Stan programmer: divergences.

The Python program to run this Stan model follows.

# PYTHON: analyse primary poll data import pandas as pd import numpy as np import pystan import pickle import sys sys.path.append( '../bin' ) from stan_cache import stan_cache # --- check version information print('Python version: {}'.format(sys.version)) print('pystan version: {}'.format(pystan.__version__)) # --- curate the data for the model # key settings intermediate_data_dir = "./Intermediate/" # analysis saved here # preference flows parties = ['L/NP', 'ALP', 'GRN', 'OTH'] preference_flows_2010 = [0.9975, 0.0, 0.2116, 0.5826] preference_flows_2013 = [0.9975, 0.0, 0.1697, 0.5330] preference_flows_2016 = [0.9975, 0.0, 0.1806, 0.5075] n_parties = len(parties) # polling data workbook = pd.ExcelFile('./Data/poll-data.xlsx') df = workbook.parse('Data') # drop pre-2016 election data df['MidDate'] = [pd.Period(d, freq='D') for d in df['MidDate']] df = df[df['MidDate'] > pd.Period('2016-07-04', freq='D')] # push One Nation into Other df['ONP'] = df['ONP'].fillna(0) df['OTH'] = df['OTH'] + df['ONP'] # set start date start = df['MidDate'].min() - 1 # the first date is day 1 df['Day'] = df['MidDate'] - start # day number for each poll n_days = df['Day'].max() # maximum days n_polls = len(df) # set discontinuity date - Turnbull's last day in office # Update for discontinuity - choosen Turnbull's last day in office ... discontinuity = pd.Period('2018-08-23', freq='D') - start # UPDATE stability = pd.Period('2018-10-01', freq='D') - start # UPDATE # manipulate polling data ... y = df[parties] center = 100 centreing_factors = center - y.mean() y = y + centreing_factors print('centreing_factors: ', centreing_factors.values) #centre_track[d add polling house data to the mix # make sure the "sum to zero" exclusions are # last in the list houses = df['Firm'].unique().tolist() exclusions = ['YouGov', 'Ipsos'] # Note: we are excluding YouGov and Ipsos # from the sum to zero constraint because # they have unusual poll results compared # with other pollsters for e in exclusions: assert(e in houses) houses.remove(e) houses = houses + exclusions map = dict(zip(houses, range(1, len(houses)+1))) df['House'] = df['Firm'].map(map) n_houses = len(df['House'].unique()) n_exclude = len(exclusions) # quality adjustment for polls df['poll_qual_adj'] = 0.0 df['poll_qual_adj'] = pd.Series(2.0, index=df.index ).where(df['Firm'].str.contains('Ipsos|YouGov|Roy Morgan'), other=0.0) # sample metrics sampleSize = 1000 # treat all polls as being of this size pseudoSampleSigma = np.sqrt((50 * 50) / sampleSize) # --- compile model # get the STAN model with open ("./Models/primary simultaneous model.stan", "r") as f: model = f.read() f.close() # encode the STAN model in C++ sm = stan_cache(model_code=model) # --- fit the model to the data ct_init = np.full([n_days, n_parties], center*1.0) def initfun(): return dict(centre_track=ct_init) chains = 5 iterations = 2000 data = { 'n_days': n_days, 'n_polls': n_polls, 'n_houses': n_houses, 'n_parties': n_parties, 'pseudoSampleSigma': pseudoSampleSigma, 'centreing_factors': centreing_factors, 'centered_obs_y': y.T, 'poll_day': df['Day'].values.tolist(), 'house': df['House'].values.tolist(), 'poll_qual_adj': df['poll_qual_adj'].values, 'n_exclude': n_exclude, 'center': center, 'discontinuity': discontinuity, 'stability': stability, # let's set the day-to-day smoothing 'sigma': 0.15, 'sigma_volatile': 0.4, # preference flows at past elections 'preference_flows_2010': preference_flows_2010, 'preference_flows_2013': preference_flows_2013, 'preference_flows_2016': preference_flows_2016 } fit = sm.sampling(data=data, iter=iterations, chains=chains, init=initfun, control={'max_treedepth':13}) results = fit.extract() # --- check diagnostics print('Stan Finished ...') #import pystan.diagnostics as psd #print(psd.check_hmc_diagnostics(fit)) # save the analysis with open(intermediate_data_dir + 'cat-' + 'output-primary-zero-sum.pkl', 'wb') as f: pickle.dump([df,sm,fit,results,data, centreing_factors, exclusions], f) f.close()

### Dirichlet-multinomial process model for primary vote share aggregation

In addition to the Gaussian model above, I also maintain a Dirichlet-multinomial process model for primary vote share. This model ensures every sample sums to a vote share of 100 percent. Because it is a more integrated model, the house effects are more tightly constrained to sum to zero across houses for each party, and across parties for each house. Without this two-way constraint, there would be too many degrees of freedom in the model, and it would yield different results every time the model is run.The double constraint means that unlike the above two models, which are centered against a core set of pollsters, the Dirichlet-multinomial process model is centered against all pollsters. As a consequence, its results differ a little from the above model.

While the Dirichlet-multinomial process model runs without diagnostics, it takes about an hour and a half to run. For this reason, it is not a model I use often.

// STAN: Primary Vote Intention Model using a Dirichlet-multinomial process data { // data size int<lower=1> n_polls; int<lower=1> n_days; int<lower=1> n_houses; int<lower=1> n_parties; int<lower=1,upper=n_days> discontinuity; int<lower=1,upper=n_days> stability; // key variables int<lower=1> pseudoSampleSize; // maximum sample size for y real<lower=1> transmissionStrength; real<lower=1> transmissionStrengthPostDiscontinuity; // give a rough idea of a staring point ... simplex[n_parties] startingPoint; // rough guess at series starting point int<lower=1> startingPointCertainty; // strength of guess - small number is vague // poll data int<lower=0,upper=pseudoSampleSize> y[n_polls, n_parties]; // a multinomial int<lower=1,upper=n_houses> house[n_polls]; // polling house int<lower=1,upper=n_days> poll_day[n_polls]; // day polling occured // TPP preference flows vector<lower=0,upper=1>[n_parties] preference_flows_2010; vector<lower=0,upper=1>[n_parties] preference_flows_2013; vector<lower=0,upper=1>[n_parties] preference_flows_2016; } parameters { simplex[n_parties] hidden_voting_intention[n_days]; matrix[n_houses, n_parties] houseAdjustment; } transformed parameters { matrix[n_houses, n_parties] aHouseAdjustment; matrix[n_houses, n_parties] tHouseAdjustment; for(p in 1:n_parties) // included parties sum to zero aHouseAdjustment[,p] = houseAdjustment[,p] - mean(houseAdjustment[,p]); for(h in 1:n_houses) // included parties sum to zero tHouseAdjustment[h,] = aHouseAdjustment[h,] - mean(aHouseAdjustment[h,]); } model{ // -- house effects model for(h in 1:n_houses) houseAdjustment[h] ~ normal(0, 0.05); // -- temporal model hidden_voting_intention[1] ~ dirichlet(startingPoint * startingPointCertainty); hidden_voting_intention[discontinuity] ~ dirichlet(startingPoint * startingPointCertainty); for (day in 2:(discontinuity-1)) hidden_voting_intention[day] ~ dirichlet(hidden_voting_intention[day-1] * transmissionStrength); for (day in (discontinuity+1):stability) hidden_voting_intention[day] ~ dirichlet(hidden_voting_intention[day-1] * transmissionStrengthPostDiscontinuity); for (day in (stability+1):n_days) hidden_voting_intention[day] ~ dirichlet(hidden_voting_intention[day-1] * transmissionStrength); // -- observed data model for(poll in 1:n_polls) y[poll] ~ multinomial(hidden_voting_intention[poll_day[poll]] + tHouseAdjustment'[,house[poll]]); } generated quantities { // aggregated TPP estimates based on past preference flows vector [n_days] tpp2010; vector [n_days] tpp2013; vector [n_days] tpp2016; for (d in 1:n_days){ tpp2010[d] = sum(hidden_voting_intention[d] .* preference_flows_2010); tpp2013[d] = sum(hidden_voting_intention[d] .* preference_flows_2013); tpp2016[d] = sum(hidden_voting_intention[d] .* preference_flows_2016); } }

The supporting python program follows.

# PYTHON: analyse primary logit poll data import pandas as pd import numpy as np import pystan import pickle import sys sys.path.append( '../bin' ) from stan_cache import stan_cache # --- check version information print('Python version: {}'.format(sys.version)) print('pystan version: {}'.format(pystan.__version__)) # --- curate the data for the model intermediate_data_dir = "./Intermediate/" # analysis saved here # preference flows parties = ['L/NP', 'ALP', 'GRN', 'OTH'] preference_flows_2010 = [0.9975, 0.0, 0.2116, 0.5826] preference_flows_2013 = [0.9975, 0.0, 0.1697, 0.5330] preference_flows_2016 = [0.9975, 0.0, 0.1806, 0.5075] # polling data workbook = pd.ExcelFile('./Data/poll-data.xlsx') df = workbook.parse('Data') # drop pre-2016 election data df['MidDate'] = [pd.Period(d, freq='D') for d in df['MidDate']] df = df[df['MidDate'] > pd.Period('2016-07-04', freq='D')] # push One Nation into Other df['ONP'] = df['ONP'].fillna(0) df['OTH'] = df['OTH'] + df['ONP'] # set start date start = df['MidDate'].min() - 1 # the first date is day 1 df['Day'] = df['MidDate'] - start # day number for each poll n_days = df['Day'].max() # maximum days n_polls = len(df) # set discontinuity date - Turnbull's last day in office discontinuity = pd.Period('2018-08-23', freq='D') - start # UPDATE stability = pd.Period('2018-10-01', freq='D') - start # UPDATE # convert poll results to multinomials pseudoSampleSize = 1000 y = df[parties] y = y.div(y.sum(axis=1), axis=0) # normalise rows y = y.mul(pseudoSampleSize).round(0).astype(int) edit = pseudoSampleSize - y.sum(axis=1) y[y.columns[-1]] += edit # ensure row-sum integrity y = y.div(10.0).astype(int).where( df['Firm'].str.contains('Ipsos|YouGov|Roy Morgan'), other=y) # adjustment for specified polls n_parties = len(y.columns) # add polling house data to the mix # make sure the sum-to-zero exclusions are last in the list # exclude from sum-to-zero on the basis of being outliers to the other houses houses = df['Firm'].unique().tolist() exclusions = ['Roy Morgan', 'YouGov'] for e in exclusions: assert(e in houses) houses.remove(e) houses = houses + exclusions map = dict(zip(houses, range(1, len(houses)+1))) df['House'] = df['Firm'].map(map) n_houses = len(df['House'].unique()) #n_exclude = len(exclusions) # startimng points startingPoint = np.array([0.4, 0.4, 0.1, 0.1]) startingPointCertainty = 10 # batch up into a dictionary data = { 'n_days': n_days, 'n_polls': n_polls, 'n_houses': n_houses, 'n_parties': n_parties, 'pseudoSampleSize': pseudoSampleSize, 'startingPoint': startingPoint, 'startingPointCertainty': startingPointCertainty, 'transmissionStrength': 80_000, 'transmissionStrengthPostDiscontinuity': 40_000, 'y': y.astype(int).values, 'poll_day': df['Day'].astype(int).values, 'house': df['House'].astype(int).values, 'preference_flows_2010': preference_flows_2010, 'preference_flows_2013': preference_flows_2013, 'preference_flows_2016': preference_flows_2016, 'discontinuity': discontinuity, 'stability': stability } # --- compile model # get the STAN model with open ("./Models/primary dirichlet model.stan", "r") as f: model = f.read() f.close() # - encode the STAN model in C++ sm = stan_cache(model_code=model, model_name='dirichlet') # --- fit the model to the data ha_init = np.zeros([n_houses, n_parties]) tha_init = np.zeros([n_houses, n_parties]) hvi_init = np.full([n_days, n_parties], 0.25) def initfun(): return dict(houseAdjustment=ha_init, tHouseAdjustment=tha_init, hidden_voting_intention=hvi_init) chains = 5 iterations = 2000 fit = sm.sampling(data=data, iter=iterations, chains=chains, init=initfun, control={'max_treedepth':13}) results = fit.extract() # save the analysis with open(intermediate_data_dir + 'output-primary-dirichlet.pkl', 'wb') as f: pickle.dump([results,df,data,parties], f) f.close()