$$TPP - Coalition_p = \beta_1 Greens_p + \beta_2 OneNation_p + \beta_3 Other_p + \epsilon$$
Which, in matrix notation, we will simplify as:
$$ y = X \beta + \epsilon $$
In this simplification, \(y\) is a column vector of (TPP - Coalition primary) vote estimates from a pollster. \(X\) is the regression design matrix, with k columns (one for each party's primary vote) and N rows (one for each of the reported poll results). \(\beta\) is a column vector of party coefficients we are seeking to find through the regression process. And \(\epsilon\) is a column vector of error terms which we assume are independent and identically distributed (iid) with a mean of \(0\). Through the magic of mathematics we can seek to minimize the sum of the squared errors using algebra and calculus to show:
$$ \sum_{i=1}^n\epsilon_i^2 = \epsilon'\epsilon = (y-X\beta)'(y-X\beta) $$
$$= y'y - \beta'X'y - y'X\beta + \beta'X'X\beta $$
$$= y'y - 2\beta'X'y + \beta'X'X\beta $$
From this last equation, we can use calculus to find the \(\beta\) that minimizes the sum of the errors squared:
$$\frac{\partial \epsilon'\epsilon}{\partial\beta} = -2X'y+2X'X\beta = 0$$
Which can be re-arranged to the famous "ex prime ex inverse ex prime why":
$$ \beta = (X'X)^{-1}X'y $$
Before I get to the results, there are a few caveats to go through. First, not all of the primary vote poll data sums to 100 per cent. In the data I looked at, the following polls did not sum to 100 per cent.
--- Does not add to 100% --- L/NP ALP GRN ONP OTH Sum Firm Date 12 35.0 38.0 10.0 7.0 9.0 99.0 Essential 12 Dec 2017 18 31.0 34.0 11.0 11.0 14.0 101.0 YouGov 14 Nov 2017 19 36.0 38.0 9.0 8.0 10.0 101.0 Essential 14 Nov 2017 21 36.0 37.0 10.0 7.0 9.0 99.0 Essential 30 Oct 2017 25 36.0 38.0 10.0 7.0 10.0 101.0 Essential 4 Oct 2017 32 35.0 34.0 14.0 1.0 15.0 99.0 Ipsos 6-9 Sep 2017 63 36.0 37.0 10.0 8.0 10.0 101.0 Essential 13-16 Apr 2017 67 35.0 37.0 10.0 8.0 11.0 101.0 Essential 24-27 Mar 2017 69 34.0 37.0 9.0 10.0 9.0 99.0 Essential 17-20 Mar 2017 75 36.0 35.0 9.0 10.0 9.0 99.0 Essential 9-12 Feb 2017 77 35.0 37.0 10.0 9.0 8.0 99.0 Essential 20-23 Jan 2017 80 37.0 37.0 9.0 7.0 9.0 99.0 Essential 9-12 Dec 2016 83 36.0 30.0 16.0 7.0 9.0 98.0 Ipsos 24-26 Nov 2016 88 37.0 37.0 11.0 5.0 9.0 99.0 Essential 14-17 Oct 2016 92 38.0 37.0 10.0 5.0 11.0 101.0 Essential 9-12 Sep 2016 109 34.0 35.0 11.0 8.0 13.0 101.0 YouGov 7-10 Dec 2017 110 32.0 32.0 10.0 11.0 16.0 101.0 YouGov 23-27 Nov 2017 ----------------------------
To manage this, I normalised all of the primary vote poll results so that they summed to 1.
The second thing I did was limit my analysis to those pollsters that had more than 10 polls since the last election. This meant I limited my analysis to polls from Newspoll and Essential.
Let's look at the multiple regression results. The key results is the block in the middle, with the three parties in the left hand column: GRN, ONP and OTH - Greens, One Nation and Others. The first coefficient column is the best linear unbiased estimate of the preference flows to the Coalition from each of these parties. The 95 per cent confidence intervals can be seen in the far right columns.
---- Essential ---- OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.999 Model: OLS Adj. R-squared: 0.999 Method: Least Squares F-statistic: 1.173e+04 Date: Sat, 17 Mar 2018 Prob (F-statistic): 2.09e-61 Time: 13:12:35 Log-Likelihood: 190.20 No. Observations: 45 AIC: -374.4 Df Residuals: 42 BIC: -369.0 Df Model: 3 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ GRN 0.2236 0.054 4.162 0.000 0.115 0.332 ONP 0.4396 0.037 11.775 0.000 0.364 0.515 OTH 0.5091 0.050 10.190 0.000 0.408 0.610 ============================================================================== Omnibus: 3.681 Durbin-Watson: 1.938 Prob(Omnibus): 0.159 Jarque-Bera (JB): 1.691 Skew: -0.020 Prob(JB): 0.429 Kurtosis: 2.051 Cond. No. 20.0 ==============================================================================
---- Newspoll ---- OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.999 Model: OLS Adj. R-squared: 0.999 Method: Least Squares F-statistic: 7414. Date: Sat, 17 Mar 2018 Prob (F-statistic): 1.81e-34 Time: 13:12:38 Log-Likelihood: 110.81 No. Observations: 26 AIC: -215.6 Df Residuals: 23 BIC: -211.9 Df Model: 3 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ GRN 0.2581 0.090 2.881 0.008 0.073 0.443 ONP 0.4976 0.035 14.202 0.000 0.425 0.570 OTH 0.4424 0.084 5.237 0.000 0.268 0.617 ============================================================================== Omnibus: 2.886 Durbin-Watson: 0.753 Prob(Omnibus): 0.236 Jarque-Bera (JB): 1.437 Skew: -0.212 Prob(JB): 0.488 Kurtosis: 1.929 Cond. No. 26.8 ==============================================================================
From the Ordinary Least Squares (OLS) multiple regression analysis, our best guess is that Essential flows 22 per cent of the Green vote to the Coalition, it flows 44 per cent of the One Nation vote, and it flows 51 per cent of the Other vote. In comparison, Newspoll flows 26 per cent of the Green vote to the Coalition, 50 per cent of the One Nation vote, and 44 per cent of the Other vote.
While we get an estimate of preference flows to the Coalition for each of the three parties from both pollsters, it is worth noting that small sample sizes have resulted in quite wide confidence intervals for these estimates.
We can also do a Bayesian multiple linear regression. The Stan model I used for this follows.
// STAN: multiple regression - no intercept - positive coefficients data { // data size int<lower=1> k; // number of pollster firms int<lower=k+1> N; // number of polls vector<lower=0,upper=1>[N] y; // response vector matrix<lower=0,upper=1>[N, k] X; // design matrix } parameters { vector<lower=0>[k] beta; // positive regression coefficients real<lower=0> sigma; // standard deviation on iid error term } model { beta ~ normal(0, 0.5); // half normal prior sigma ~ cauchy(0, 0.01); // half cauchy prior y ~ normal(X * beta, sigma);// regression model }
The results I got were very similar to the standard OLS results. In this case I have identified the 95% credible interval, which is the Bayesian equivalent of the confidence interval. Again, the One Nation and Other results are quite different, and swapped about between the two pollsters.
From Stan: for Essential 2.5% median 97.5% GRN 0.121745 0.228054 0.335852 ONP 0.362631 0.438261 0.513884 OTH 0.404663 0.506283 0.607355
From Stan: for Newspoll 2.5% median 97.5% GRN 0.089876 0.267659 0.450531 ONP 0.420720 0.494935 0.566663 OTH 0.259106 0.433838 0.601585
From the Bayesian analysis, our best guess is that Essential flows 23 per cent of the Green vote to the Coalition. It flows 44 per cent of the One Nation vote. And it flows 51 per cent of the Other vote. In comparison, Newspoll flows 27 per cent of the Green vote to the Coalition, 49 per cent of the One Nation vote, and 43 per cent of the Other vote.
These Bayesian results can be charted as probability densities as follows. In these charts the median sample for each distribution is highlighted with a thin vertical line.
For completeness, the supporting Python program that generated this analysis follows.
# PYTHON - estimates of preference flows from polling data import pandas as pd import numpy as np import statsmodels.api as sm import matplotlib.pyplot as plt import sys sys.path.append( '../bin' ) from stan_cache import stan_cache # --- chart results graph_dir = './Graphs/' walk_leader = 'STAN-PREFERENCE-FLOWS-' plt.style.use('../bin/markgraph.mplstyle') # --- curate data for analysis workbook = pd.ExcelFile('./Data/poll-data.xlsx') df = workbook.parse('Data') # drop polls without one nation df = df[df['ONP'].notnull()] # drop pre-2016 election data df['MidDate'] = [pd.Period(d, freq='D') for d in df['MidDate']] df = df[df['MidDate'] > pd.Period('2016-07-04', freq='D')] # normalise the data - still in 0 to 100 range parties = ['GRN', 'ONP', 'OTH'] all = ['L/NP', 'ALP'] + parties df['Sum'] = df[all].sum(axis=1) bad = df[all + ['Sum', 'Firm', 'Date']] bad = bad[(bad['Sum'] < 99.5) | (bad['Sum'] > 100.5)] print('--- Does not add to 100% ---') print(bad) print('----------------------------') df[all] = df[all].div(df[all].sum(axis=1), axis=0) * 100.0 # --- Analyse the curated data firms = df['Firm'].unique() for firm in firms: cases = df[df['Firm']==firm] if len(cases) <= 10: continue # not enough to analyse # --- classic OLS multiple regression # get response vector and design matrix in 0 to 1 range y = (cases['TPP L/NP'] - cases['L/NP']) / 100.0 X = cases[parties] / 100.0 # regression estimation model = sm.OLS(y, X).fit() print('\n\n---- {} ----'.format(firm)) # Print out the statistics print(model.summary()) # --- let's do the same thing with Stan # input data data = { 'y': y, 'X' : X, 'N': len(y), 'k': len(X.columns) } # helpers quants = [2.5, 50, 97.5] labels = ['2.5%', 'median', '97.5%'] with open ("./Models/preference flows.stan", "r") as file: model_code = file.read() file.close() # model stan = stan_cache(model_code=model_code, model_name='preference flows') fit = stan.sampling(data=data, iter=10000, chains=5) results = fit.extract() # capture the coefficients coefficients = results['beta'] print('--- Coefficient Shape: {} ---'.format(coefficients.shape)) estimates = pd.DataFrame() for i, party in zip(range(len(parties)), parties): q = np.percentile(coefficients[:,i], quants) row = pd.DataFrame(q, index=labels, columns=[party]).T estimates = estimates.append(row) # capture sigma sigma = results['sigma'].T q = np.percentile(sigma, quants) row = pd.DataFrame(q, index=labels, columns=['sigma']).T estimates = estimates.append(row) # print results from Stan print('From Stan: for {}'.format(firm)) print(estimates) # plot results from Stan coefficients = pd.DataFrame(coefficients, columns=parties) ax = coefficients.plot.kde(color=['darkgreen', 'goldenrod', 'orchid']) ax.set_title('Kernel Density for Preference Flows by {}'.format(firm)) ax.set_ylabel('KDE') ax.set_xlabel('Estimated Flow (Proportion)') for i in coefficients.columns: ax.axvline(x=coefficients[i].median(), color='gray', linewidth=0.5) fig = ax.figure fig.set_size_inches(8, 4) fig.tight_layout(pad=1) fig.text(0.99, 0.01, 'marktheballot.blogspot.com.au', ha='right', va='bottom', fontsize='x-small', fontstyle='italic', color='#999999') fig.savefig(graph_dir+walk_leader+firm+'.png', dpi=125) plt.close()
No comments:
Post a Comment