Saturday, May 18, 2019

Last poll update

We now have all of the opinion poll results prior to today's election. And what a remarkable set of numbers they are: The 16 polls taken since April 11, when the election was called, have all been in the range 51-52 for Labor and 48-49 for the Coalition.

Date Firm Primary TPP
L/NP ALP GRN ONP OTH L/NP ALP
15-16 May 2019 Newspoll 38 37 9 3 13 48.5 51.5
13-15 May 2019 Galaxy 39 37 9 3 12 49 51
12-15 May 2019 Ipsos 39 33 13 4 11 49 51
10-14 May 2019 Essential 38.5 36.2 9.1 6.6 9.6 48.5 51.5
10-12 May 2019 Roy Morgan 38.5 35.5 10 4 12 48 52
9-11 May 2019 Newspoll 39 37 9 4 11 49 51
2-6 May 2019 Essential 38 34 12 7 9 48 52
4-5 May 2019 Roy Morgan 38.5 34 11 4 12.5 49 51
2-5 May 2019 Newspoll 38 36 9 5 12 49 51
1-4 May 2019 Ipsos 36 33 14 5 12 48 52
25-29 Apr 2019 Essential 39 37 9 6 9 49 51
27-28 Apr 2019 Roy Morgan 39.5 36 9.5 2.5 12.5 49 51
26-28 Apr 2019 Newspoll 38 37 9 4 12 49 51
23-25 Apr 2019 Galaxy 37 37 9 4 13 48 52
20-21 Apr 2019 Roy Morgan 39 35.5 9.5 4.5 11.5 49 51
11-14 Apr 2019 Newspoll 39 39 9 4 9 48 52


If we assume the sample size for every one of these polls was 2000 electors, and if we assume that the population voting intention was 48.5/51.5 for the entire period, then the chance of every poll being in this range is about 1 in 1661.

In statistics, we talk about rejecting the null hypothesis when p < 0.05. In this case p < 0.001. So let's reject the null hypothesis. These numbers are not the raw output from 16 independent, randomly-sampled surveys.

While it could be that the samples are not independent (for example, if the pollsters used panels), or that the samples are not sufficiently random and representative, I suspect the numbers have been manipulated in some way. I would like to think this manipulation is some valid and robust numerical process. But without transparency from the pollsters, how can I be sure?

For those interested, the python code snippet for the above calculation follows.

import scipy.stats as ss
import numpy as np

p = 48.5
q = 100 - p
sample_size = 2000
sd = np.sqrt((p * q) / sample_size)
print(sd)

p_1 = ss.norm(p, sd).cdf(49.5) - ss.norm(p, sd).cdf(47.5)
print('probability for one poll: {}'.format(p_1))

p_16 = pow(p_1, 16)
print('probability for sixteen polls in a row: {}'.format(p_16))

My next problem is aggregation. My Bayesian aggregation methodology depends on there being polls that are normally distributed around the population mean. In practice, it is the outliers (and the inliers towards the edges of the normal range) that move the aggregation. There are no such data points in this series.

Setting this aside, when I run the aggregation on all of the polling data since the last election I get a final aggregated poll estimate of 48.4 for the Coalition to 51.6 for Labor. On those results, I would expect Labor to win around 80 seats and form government.



The ensemble of moving averages is broadly consistent.


Turning to the primary votes, we can see ...









No comments:

Post a Comment