## Monday, January 20, 2020

### Polls, the bias-variance trade-off and the 2019 Election

Data scientists typically want to minimise the error associated with their predictions. Certainly, pollsters want their opinion poll estimates immediately prior to an election to reflect the actual election outcome as closely as possible.

However, this is no easy task. With many predictive models, there is often a trade-off between the error associated with bias and the error associated with variance. In data science, and particularly in the domain of machine learning, this phenomenon is known as the bias-variance trade-off or the bias-variance dilemma.

According to the bias-variance trade-off model, less than optimally tuned predictive models often either deliver high bias and low variance predictions, or the opposite: low bias and high variance predictions. At this stage, a chart might help with your intuition, and I will take a little time to define what I mean by bias and variance.

### Two types of prediction error: bias and variance

Bias tells us how closely our predictions are typically, or on average, aligned with the true value for the population or the actual outcome. In terms of the above bullseye diagram, how closely do our predictions on average land in the very centre of the bullseye?

Variance tells us about the variability in our predictions. It tells us how tightly our individual predictions cluster. In other words, variance indicates how far and wide our predictions are spread. Do they span a small range or a large range?

In our 2x2 bullseye grid above, a predictive model (or in the case of the 2019 opinion polls a system of predictive models) can deliver predictions with high or low variance and high or low bias.

### In 2019, the opinion polls exhibited high bias and low variance

My contention is that collectively, the opinion polling in the six-week lead-up to the 2019 Australian Federal election exhibited high bias and low variance. It was in the top lefthand quadrant of the bulls-eye diagram above.

As evidence for this contention, I would note: Kevin Boneham has argued that by Australian standards, the opinion polls leading up to the 2019 election had their biggest miss since the mid-1980s. Also, I have previously commented on the extraordinarily low variance among the opinions polls prior to the 2019 election.

### Changes in statistical practice and social realities

Before we come to the implications of the high bias/low variance polling outcome in 2019, it is worth reflecting how polling practice has been driven to change in recent times. In particular, I want to highlight the increased importance of data analytics in the production of opinion poll estimates.

Newspoll was famous for introducing an era of reliable and accurate polling in the Australian context. With the near-universal penetration of land-line telephones to Australian households by the 1980s, pollsters were able to develop sample frames that came close to the Holy Grail of giving every member of the Australian voting public an equal chance of being included in any survey the pollster run.

While pollsters came close, their sampling frames were not perfect, and weightings and adjustments were still needed for under-represented groups. Nonetheless, the extent to which weightings and analytics were needed to make up for shortfalls in the sample frame was small compared with today.

So what has changed since the mid-1980s? Quite a few things, here are but a few:

• the general population use of landlines has declined. In some cases, the plain old telephone system (POTS) has been replaced by a voice over internet protocol (VOIP) service. Many households have terminated their landline service altogether, in favour of the mobile phone. Younger people in share houses are very unlikely to have a shared telephone landline or even a VOIP service. Further, mobile and VOIP numbers are not listed in the White Pages by default. As a consequence, constructing a telephone-based sample frame is much more challenging than it once was.
• the rise of the digital technologies has seen a growth in robocalls, which many people find annoying. Caller identification on phones has given rise to the practice of call screening. If a polling call is taken, busy people and those not interested in politics often just hang up. As a result of these and other factors, participation rates in telephone polls have plummeted. In the United States, Pew Research reported a decline in telephone survey responses from 36 per cent in 1997 to just 6 per cent in 2018. Newspoll's decision to abandon telephone polling suggests something similar may have happened in Australia
• the newspapers that purchase much of the publicly available opinion polling have had their budgets squeezed, and they are wanting to spend less on polling. Ironically, lower telephone survey response rates (as noted above) are driving up the costs of telephone surveys. As a consequence, pollsters have had to turn to lower-cost (and more challenging) sampling techniques, such as internet surveys.
• While more people may be online these days (as Newspoll argues), online participation remains less widespread than landlines in the mid-1980s.

Because of the above, pollsters need to work much harder to address the short-comings in their sampling frame. Pollsters need to make far more data-driven adjustments to today's raw poll results than happened in the mid-1980s. This work is far more complex than simply weighting for sub-cohort population shares. Which brings us back to the bias-variance trade-off.

### What the bias-variance trade-off outcome suggests

The error for a predictive model can be thought of as the sum of the squared differences between the true value and all of the predicted values from the model

$$Error = \sum(predicted_y - true_y)^2$$

The critical insight from the bias-variance trade-off is that this error can be decomposed into three components:

$$Error = Bias^2 + Variance + Irreducible Error$$

We have already discussed bias and variance. The irreducible error is that error that cannot be removed by developing a better predictive model. All predictive models will have some noise-related error.

Simple models tend to underfit the data. They do not capture enough of the information or "signal" available in the data. As a result, they tend to have low variance, but high bias.

Complex models tend to overfit the data. Because they typically have many parameters, they tend to treat noise in the data as the signal. This tends to result in predictive models that have low bias but high variance (driven by the noise in the data).

Again, to help develop our intuition (and to avoid the complex mathematics), let's turn to a visual aid to get a sense of what I mean by under-fit and over-fit a predictive model to the data.

The data below (the red dots) comes from a randomly perturbed sine wave between 0 and $$\pi$$. I have three models that look at those red dots (the perturbed data) and try to predict the form or shape of the underlying curve:
• a simple linear regression - which predicts a straight line
• a quadratic regression - which predicts a parabola
• a high-order polynomial regression - which can predict quite a complicated line

The linear regression (the straight blue line, top left) provides an under-fitted model for the curve evident in the data. The quadratic regression (the inverted parabola, bottom left) provides a reasonably good model. It is the closest of the models to a sine wave between the values of 0 and $$\pi$$. And the polynomial regression of degree 25 (the squiggly blue line, bottom right) over-fits the data and reflects some of the random noise introduced by the perturbation process.

The next chart compares the three blue models above, with the actual sine wave before the perturbation. The closer the blue line is to the red line, the less error the model has.

The model with the least error between the predicted line and the sine curve is the parabola. The straight line and the squiggly line both have differences with the actual sine curve for most of their length.

When we run these three predictive models many times we can see something about the type of error for these three models. In the next chart, we have generated a randomly perturbed set of points 1000 times (just like we did above) and asked each of the models to predict the underlying line 1000 times. Each of the predictions from the 1000 model runs has then been plotted on the next chart. For reference, the sine curve the models are trying to predict is over-plotted in light-grey.

Looking at the multiple runs, the bias and variance for the distribution of predictions from the three models become clear:

• If we look at the blue linear model in the top left, we see high bias (the straight-line predictions from this model have significant points of deviation from the sine curve). But the 1000 model predictions are tightly clustered showing low variance across the predictions. The distribution of predictions from the under-fitted model shows high bias and low variance.
• The green quadratic model in the bottom left exhibits low bias. The 1000 model predictions follow closely to the sine curve. They are also tightly clustered showing low variance. The distribution of predictions from the optimally-fitted model shows low bias and low variance. These predictions have the least total error.
• The red high-order polynomial model in the bottom right exhibits low bias: on average, the 1000 predictions are close to the sine curve. However, the multiple runs exhibit quite a spread of predictions compared with the other two models. The distribution of predictions from the over-fitted model shows a low bias, but high variance.

Coming back to the three-component cumulative error equation above, we can see the least total error occurs at an optimal point between a simple and complex model. Again, let's use a stylised chart to aid intuition.

As model complexity increases, the prediction error associated with bias decreases, but the prediction error from variance increases. This minimal total error sweet-spot (between simplicity and complexity) is the key insight of the bias-variance trade-off model. This is the best trade-off between the error from prediction bias and the error from prediction variance.

### So what does this mean for the polls before the 2019 election?

The high bias and low variance in the 2019 poll results suggest that collectively the pollsters had not done enough to ensure their polling models produced optimal election outcome predictions. In short, their predictive models were not complex enough to adequately overcome the sampling frame problems (noted above) that they were wrestling with.

Without transparency from the pollsters, it is hard to know specifically where the pollsters' approach lacked the necessary complexity.

Nonetheless, I have wondered to what extent the pollsters use each other (either purposefully or through confirmation bias) to benchmark their predictions and adjust for the (perceived) bias that arises from issues with their sampling frames. In effect, mutual cross-validation, should it be occurring, would effectively simplify their collective models.

I have also wondered whether the pollsters look at non-linearity effects between their sample frame and changes in voting intention in the Australian public. If the samples that the pollsters are working with are more educated and more rusted on to a political party on average, their response to changing community sentiment may be muted compared with the broader community. For example, a predictive model that has been validated when community sentiment is at (say) 47 or 48 per cent, may not be reliable when community sentiment moves to 51 per cent.

I have wondered whether the pollsters built rolling averages or some other smoothing technique into their predictive models. Again, if so, this could be a source of simplicity.

I have wondered whether the pollsters are cycling through the same survey respondents over and over again. Again, if so, this could be a source of simplicity.

I have many questions as to what might have gone wrong in 2019. The bias-variance trade-off model provides some suggestion as to what the nature of the problem might have been.

## Friday, December 6, 2019

### Aggregated attitudinal polling

At this point in the election cycle, only Newspoll is publishing primary vote share and two-party preferred population estimates. So there is nothing to aggregate across polling houses when it comes to voting intention. However, both Essential and Newspoll are publishing attitudinal polling. So I decided to build a Dirichlet-multinomial process model to see what trends there are in the attitudinal polling since the 2019 election.

First, however, we will look at the output from the model, before looking at the model itself.

Let's begin with the preferred prime minister polling. We see a small dip in the proportion of the population preferring the Prime Minister over the period (from 45.4 to 44.9 per cent). The Opposition Leader has improved a little over the period (from 26 to 29 per cent), but he is much less preferred than the Prime Minister. The "undecideds" have declined a little (from 29 to 26 per cent).

The median lines from the above charts can be combined on a chart as follows.

The model allows us to compare house effects in preferred Prime Minister polling. Those polled by Essential are more likely to express a preference on their preferred prime minister compared with the other two houses.

The next set of charts are about satisfaction with the Prime Minister's performance. Satisfaction with the Prime Minister has declined from 48 to 45 per cent. Dissatisfaction has increased from 37 to 44 per cent.

Satisfaction with the Opposition Leader has improved from 37 to 38 per cent. Dissatisfaction has increased from 30 to 36 per cent. Undecideds have decreased from 32 to 25 per cent.

In summary, both leaders have seen a decline in net satisfaction. On this metric, the Prime Minister has fallen further than the Opposition Leader. The Opposition Leader ends the year with a higher net satisfaction rating compared with the Prime Minister.

The model that produced the above charts is as follows.

// STAN: Simplex Time Series Model
//  using a Dirichlet-multinomial process

data {
// data size
int<lower=1> n_polls;
int<lower=1> n_days;
int<lower=1> n_houses;
int<lower=1> n_categories;

// key variables
int<lower=1> pseudoSampleSize; // maximum sample size for y
real<lower=1> transmissionStrength;

// give a rough idea of a staring point ...
simplex[n_categories] startingPoint; // rough guess at series starting point
int<lower=1> startingPointCertainty; // strength of guess - small number is vague

// poll data
int<lower=0,upper=pseudoSampleSize> y[n_polls, n_categories]; // a multinomial
int<lower=1,upper=n_houses> house[n_polls]; // polling house
int<lower=1,upper=n_days> poll_day[n_polls]; // day polling occured
}

parameters {
simplex[n_categories] hidden_voting_intention[n_days];
}

transformed parameters {
for(p in 1:n_categories) // included parties sum to zero
for(h in 1:n_houses) // included parties sum to zero
}

model{
// -- house effects model
for(h in 1:n_houses)

// -- temporal model
hidden_voting_intention[1] ~ dirichlet(startingPoint * startingPointCertainty);
for (day in 2:n_days)
hidden_voting_intention[day] ~
dirichlet(hidden_voting_intention[day-1] * transmissionStrength);

// -- observed data model
for(poll in 1:n_polls)
y[poll] ~ multinomial(hidden_voting_intention[poll_day[poll]] +
}

The model assumes that house effects sum to zero (both across polling houses and across the simplex categories). I set the startingPointCertainty variable to 10. The prior on the startingPoints is 0.333 for each series. The day-to-day transmissionStrength is set to 50,000 (attitudes yesterday are much the same as today). The pseudoSampleSize is set to 1000.

As usual, the data for this analysis has been sourced from Wikipedia.

## Saturday, June 29, 2019

### Three anchored models

I have three anchored models for the period 2 July 2016 to 18 May 2019. The first is anchored to the 2016 election result (left anchored). The second model is anchored to the 2019 election result (right anchored). The third model is anchored to both election results (left and right anchored).  Let's look at these models.

The first thing to note is that the median lines in the left-anchored and right-anchored models are very similar. It is pretty much the same line moved up or down by 1.4 percentage points. As we have discussed previously, this difference of 1.4 percentage points is effectively a drift in the collective polling house effects over the period from 2016 to 2019. The polls opened after the 2016 election with a collective 1.7 percentage point pro-Labor bias. This bias grew by a further 1.4 percentage points to reach 3.1 percentage points at the time of the 2019 election (the difference between the yellow line and the blue/green lines on the right hand side of the last chart above).

The third model: the left-and-right anchored model forces this drift to be reconciled within the model (but without any guidance from the model). The left-and-right anchored model explicitly assumes there is no such drift (ie. house effects are constant and unchanging). In handling this unspecified drift, the left-and-right anchored model has seen much of the adjustment occur close to the two anchor points at the left and right extremes of the chart. The shape of the middle of the chart is not dissimilar to the singularly anchored charts.

While this is the output for the left-and-right anchored model, I would advise caution in assuming that the drift in polling house effects actually occurred in the period immediately after the 2016 election and immediately prior to the 2019 election. It is just that this is the best mathematical fit for a model that assumes there has been no drift. The actual drift could have happened slowly over the entire period, or quickly at the beginning, somewhere in the middle, or towards the end of the three year period.

My results for the left-and-right anchored model are not dissimilar to Jackman and Mansillo. The differences between our charts are largely a result of how I treat the day-to-day variance in voting intention (particularly following the polling discontinuity associated with the leadership transition from Turnbull to Morrison). I chose to specify this variance, rather than model it as a hyper-prior.  I specified this parameter because: (a) we can observe higher volatility immediately following discontinuity events, and (b) the sparse polling results in Australia, especially in the 2016-19 period, produces an under-estimate for this variance in this model.

All three models have a very similar result for the discontinuity event itself: an impact just under three percentage points. Note: these charts are not in percentage points, but vote shares.

And just to complete the analysis, let's look at the house effects. With all of these houses effects, I would urge caution. These house effects are an artefact of the best fit in models that do not allow for the 1.4 percentage point drift in collective house effects that occurred between 2016 and 2019.

The three models are almost identical.
// STAN: Two-Party Preferred (TPP) Vote Intention Model
//     - Fixed starting-point

data {
// data size
int n_polls;
int n_days;
int n_houses;

// assumed standard deviation for all polls
real pseudoSampleSigma;

// poll data
vector[n_polls] y; // TPP vote share
int house[n_polls];
int day[n_polls];

// period of discontinuity event
int discontinuity;
int stability;

// election outcome anchor point
real start_anchor;
}

transformed data {
// fixed day-to-day standard deviation
real sigma = 0.0015;
real sigma_volatile = 0.0045;

// house effect range
real lowerHE = -0.07;
real upperHE = 0.07;

// tightness of anchor points
real tight_fit = 0.0001;
}

parameters {
vector[n_days] hidden_vote_share;
vector[n_houses] pHouseEffects;
real disruption;
}

model {
// -- temporal model [this is the hidden state-space model]
disruption ~ normal(0.0, 0.15); // PRIOR
hidden_vote_share[1] ~ normal(start_anchor, tight_fit); // ANCHOR

hidden_vote_share[2:(discontinuity-1)] ~
normal(hidden_vote_share[1:(discontinuity-2)], sigma);

hidden_vote_share[discontinuity] ~
normal(hidden_vote_share[discontinuity-1]+disruption, sigma);

hidden_vote_share[(discontinuity+1):stability] ~
normal(hidden_vote_share[discontinuity:(stability-1)], sigma_volatile);

hidden_vote_share[(stability+1):n_days] ~
normal(hidden_vote_share[stability:(n_days-1)], sigma);

// -- house effects model - uniform distributions
pHouseEffects ~ uniform(lowerHE, upperHE); // PRIOR

// -- observed data / measurement model
y ~ normal(pHouseEffects[house] + hidden_vote_share[day],
pseudoSampleSigma);
}


// STAN: Two-Party Preferred (TPP) Vote Intention Model
//     - Fixed end-point only

data {
// data size
int n_polls;
int n_days;
int n_houses;

// assumed standard deviation for all polls
real pseudoSampleSigma;

// poll data
vector[n_polls] y; // TPP vote share
int house[n_polls];
int day[n_polls];

// period of discontinuity event
int discontinuity;
int stability;

// election outcome anchor point
real end_anchor;
}

transformed data {
// fixed day-to-day standard deviation
real sigma = 0.0015;
real sigma_volatile = 0.0045;

// house effect range
real lowerHE = -0.07;
real upperHE = 0.07;

// tightness of anchor points
real tight_fit = 0.0001;
}

parameters {
vector[n_days] hidden_vote_share;
vector[n_houses] pHouseEffects;
real disruption;
}

model {
// -- temporal model [this is the hidden state-space model]
disruption ~ normal(0.0, 0.15); // PRIOR
hidden_vote_share[1] ~ normal(0.5, 0.15); // PRIOR

hidden_vote_share[2:(discontinuity-1)] ~
normal(hidden_vote_share[1:(discontinuity-2)], sigma);

hidden_vote_share[discontinuity] ~
normal(hidden_vote_share[discontinuity-1]+disruption, sigma);

hidden_vote_share[(discontinuity+1):stability] ~
normal(hidden_vote_share[discontinuity:(stability-1)], sigma_volatile);

hidden_vote_share[(stability+1):n_days] ~
normal(hidden_vote_share[stability:(n_days-1)], sigma);

// -- house effects model - uniform distributions
pHouseEffects ~ uniform(lowerHE, upperHE); // PRIOR

// -- observed data / measurement model
y ~ normal(pHouseEffects[house] + hidden_vote_share[day],pseudoSampleSigma);
end_anchor ~ normal(hidden_vote_share[n_days], tight_fit); //ANCHOR
}


// STAN: Two-Party Preferred (TPP) Vote Intention Model
//     - Fixed starting-point and end-point

data {
// data size
int n_polls;
int n_days;
int n_houses;

// assumed standard deviation for all polls
real pseudoSampleSigma;

// poll data
vector[n_polls] y; // TPP vote share
int house[n_polls];
int day[n_polls];

// period of discontinuity event
int discontinuity;
int stability;

// election outcome anchor point
real start_anchor;
real end_anchor;
}

transformed data {
// fixed day-to-day standard deviation
real sigma = 0.0015;
real sigma_volatile = 0.0045;

// house effect range
real lowerHE = -0.07;
real upperHE = 0.07;

// tightness of anchor points
real tight_fit = 0.0001;
}

parameters {
vector[n_days] hidden_vote_share;
vector[n_houses] pHouseEffects;
real disruption;
}

model {
// -- temporal model [this is the hidden state-space model]
disruption ~ normal(0.0, 0.15); // PRIOR
hidden_vote_share[1] ~ normal(start_anchor, tight_fit); // ANCHOR

hidden_vote_share[2:(discontinuity-1)] ~
normal(hidden_vote_share[1:(discontinuity-2)], sigma);

hidden_vote_share[discontinuity] ~
normal(hidden_vote_share[discontinuity-1]+disruption, sigma);

hidden_vote_share[(discontinuity+1):stability] ~
normal(hidden_vote_share[discontinuity:(stability-1)], sigma_volatile);

hidden_vote_share[(stability+1):n_days] ~
normal(hidden_vote_share[stability:(n_days-1)], sigma);

// -- house effects model - uniform distributions
pHouseEffects ~ uniform(lowerHE, upperHE); // PRIOR

// -- observed data / measurement model
y ~ normal(pHouseEffects[house] + hidden_vote_share[day],
pseudoSampleSigma);
end_anchor ~ normal(hidden_vote_share[n_days], tight_fit); //ANCHOR
}


Update: Kevin Bonham is also exploring what public voting intention might have looked like during the 2016-19 period.

## Tuesday, June 18, 2019

### Further polling reflections

I have been pondering on whether the polls have been out of whack for some time, or whether it was a recent failure (over the previous 3, 6 or (say) 12 months). In previous posts, I looked at YouGov in 2017, and at monthly polling averages prior to the 2019 election.

Today I want to look at the initial polls following the 2016 election. First, however, let's recap the model I used for the 2019 election. In this model, I excluded YouGov and Roy Morgan from the sum-to-zero constraint on house effects. I have added a starting point reference to these charts (and increased the rounding on the labels from one decimal place to two. However, I would caution on reading these models to two decimal places, the models are not that precise).

What is worth noting is that this series opens on 6 July 2016 some 1.7 percentage points down from the election result of 50.36 per cent of the two-party preferred (TPP) vote for the Coalition on 2 July 2016. The series closes some 3.1 percentage points down from the 18 May 2019 election result. It appears that the core-set of Australian pollsters started some 1.7 percentage points off the mark, and collectively gained a further 1.4 percentage points of error over the period from July 2016 to May 2019.

These initial polls are all from Essential, and they are under-dispersed. (We discussed the under-dispersion problem here, here, here, and here. I will come back to this problem in a future post). The first two Newspolls were closer to the election result, but they then aligned with Essential from then on. The Newspolls from this period are also under-dispersed.

We can see how closely Newspoll and Essential tracked each other on average from the following chart of average house effects. I have Newspoll twice in this chart, based on the original method for allocating of preferences, and (Newspoll2) for the revised allocation of One Nation preferences from late in 2017.

If I had aggregated the polls prior to the 2019 election by anchoring the line to the previous election, I would have achieved a better estimate of the Coalition's performance than I did. Effectively I would have predicted a tie or a very narrow Coalition victory if I had aggregated the polls for this election with an anchor to the previous election.

A good question to ask at this point is why did I not anchor the model to the previous election? The short answer is that I have watched a number of aggregators in past election cycles use an anchored model and end up with worse predictions than those who assumed the house effects across the pollsters cancel each other out on average. I have also assumed that pollsters use elections to recalibrate their polling methodologies, and this recalibration represents a series break. A left-hand side anchored series assumes there have been no series breaks.

In summary, at least 1.7 percentage points of polling error were baked in from the very first polls following the 2016 election. Over the period since July 2016, this error has increased to 3.1 percentage points.

Wonky note: For the anchored model, I changed the priors on house effects from weakly informative normals centred on zero, to uniform priors in the range -6% to +6%. I did this because the weakly informative priors were dragging the aggregation towards the centre of the data points.

The anchored STAN model code follows.
// STAN: Two-Party Preferred (TPP) Vote Intention Model
//     - Updated to for fixed starting point

data {
// data size
int n_polls;
int n_days;
int n_houses;

// assumed standard deviation for all polls
real pseudoSampleSigma;

// poll data
vector[n_polls] y; // TPP vote share
int house[n_polls];
int day[n_polls];

// period of discontinuity event
int discontinuity;
int stability;

// previous election outcome anchor point
real election_outcome;
}

transformed data {
// fixed day-to-day standard deviation
real sigma = 0.0015;
real sigma_volatile = 0.0045;

// house effect range
real lowerHE = -0.06;
real upperHE = 0.06;
}

parameters {
vector[n_days] hidden_vote_share;
vector[n_houses] pHouseEffects;
real disruption;
}

model {
// -- temporal model [this is the hidden state-space model]
disruption ~ normal(0.0, 0.15); // PRIOR
hidden_vote_share[1] ~ normal(election_outcome, 0.00001);

hidden_vote_share[2:(discontinuity-1)] ~
normal(hidden_vote_share[1:(discontinuity-2)], sigma);

hidden_vote_share[discontinuity] ~
normal(hidden_vote_share[discontinuity-1]+disruption, sigma);

hidden_vote_share[(discontinuity+1):stability] ~
normal(hidden_vote_share[discontinuity:(stability-1)], sigma_volatile);

hidden_vote_share[(stability+1):n_days] ~
normal(hidden_vote_share[stability:(n_days-1)], sigma);

// -- house effects model
pHouseEffects ~ uniform(lowerHE, upperHE); // PRIOR

// -- observed data / measurement model
y ~ normal(pHouseEffects[house] + hidden_vote_share[day],
pseudoSampleSigma);
}


## Saturday, June 8, 2019

### Was YouGov the winner of the 2016-19 polling season?

I have been wondering whether the pollsters have been off the mark for some years, or whether this is something that emerged recently (say, since Morrison's appointment as Prime Minister or since Christmas 2018). Today's exploration suggests the former: The pollsters have been off the mark for a number of years this electoral cycle.

Back in June 2017, international pollster YouGov appeared on the Australian polling scene with what looked like a fairly implausible set of poll results. The series was noisy, and well to the right of the other polling houses at the time. Back then, most pundits dismissed YouGov as a quaint curiosity.

 Date Firm Primary % TPP % L/NP ALP GRN ONP OTH L/NP ALP 7-10 Dec 2017 YouGov 34 35 11 8 13 50 50 23-27 Nov 2017 YouGov 32 32 10 11 16 47 53 14 Nov 2017 YouGov 31 34 11 11 14 48 52 14-18 Sep 2017 YouGov 34 35 11 9 11 50 50 31 Aug - 4 Sep 2017 YouGov 34 32 12 9 13 50 50 17-21 Aug 2017 YouGov 34 33 10 10 13 51 49 20-24 Jul 2017 YouGov 36 33 10 8 13 50 50 6-11 Jul 2017 YouGov 36 33 12 7 12 52 48 22-27 Jun 2017 YouGov 33 34 12 7 14 49 51

The 2017 YouGov series was short-lived. In December 2017, YouGov acquired Galaxy, which had acquired Newspoll in May 2015. YouGov ceased publishing poll results under its own brand. Newspoll continued without noticeable change. By the time of the 2019 election, these nine YouGov polls from 2017 had been long forgotten.

Today's thought experiment: What if those nine YouGov polls were correct (on average)? I can answer this question by changing the Bayesian aggregation model so that is centred on the YouGov polls, rather assuming the house affects across a core-set of pollsters sum to zero. Making this change yields a final poll aggregate of 51.3 per cent for the Coalition. This would have been remarkably prescient of the final 2019 election outcome (51.5 per cent).

The house effects in this model are as follows.

And if we adjust the poll results for the median house effects identified in the previous chart, we get a series like this.

YouGov is a reliable international polling house. It gets a B-grade from FiveThirtyEight. When it entered the Australian market in 2017, YouGov produced poll results that were on average up to 3 percentage points to the right of the other pollsters. The election in 2019 also produced a result that was around 3 percentage points to the right of the pollsters. That a respected international pollster can enter the Australian market and produce this result in 2017, suggests our regular Australian pollsters may have been missing the mark for quite some time.

Note: as usual, the above poll results are sourced from Wikipedia.

## Sunday, June 2, 2019

### More random reflections on the 2019 polls

Over the next few months, I will post some random reflections on the polls prior to the 2019 Election, and what went wrong. Today's post is a look at the two-party preferred (TPP) poll results over the past 12 months (well from 1 May 2018 to be precise). I am interested in the underlying patterns: both the periods of polling stability and when the polls changed.

With blue lines in the chart below, I have highlighted four periods when the polls look relatively stable. The first period is the last few months of the Turnbull premiership. The second period is Morrison's new premiership for the remainder of 2018. The third period is the first three months (and ten days in April) of 2019, prior to the election being called. The fourth and final period is from the dissolution of Parliament to the election. What intrigues me is the relative polling stability during each of these periods, and the marked jumps in voting intention (often over a couple of weeks) between these periods of stability.

To provide another perspective, I have plotted in red the calendar month polling averages. For the most part, these monthly averages stay close to the four period-averages I identified.

The only step change that I can clearly explain is the change from Turnbull to Morrison (immediately preceded by the Dutton challenge to Turnbull's leadership). This step change is emblematic of one of the famous aphorisms of Australian politics: disunity is death.

It is ironic to note that the highest monthly average for the year was 48.8 per cent in July 2018 under Turnbull. It is intriguing to wonder whether the polls were as out of whack in July 2018 as they were in May 2019 (when they collectively failed to foreshadow a Coalition TPP vote share at the 2019 election in excess of 51.5 per cent). Was Turnbull toppled for electability issues when he actually had 52 per cent of the TPP vote share?

The next step change that might be partially explainable is the last one: chronologically, it is associated with the April 2 Budget followed by the calling of the election on 11 April 2019. The Budget was a classic pre-election Budget (largess without nasties), and calling the election focuses the mind of the electorate on the outcome. However, I really do not find this explanation satisfying. Budgets are very technical documents, and people usually only understand the costs and benefits when they actually experience them. Nothing in the Budget was implemented prior to the election being called.

I am at a loss to explain the step change over the Christmas/New-Year period at the end of 2018 and the start of 2019. It was clearly a summer of increasing content with the government.

I am also intrigued by the question of whether the polls have been consistently wrong over this one-year period, or whether the polls have increasingly deviated from the population voting intention as they failed to fully comprehend Morrison's improved polling position over recent months.

Note: as usual I am relying on Wikipedia for the Australian opinion polling data.

## Thursday, May 23, 2019

### Further analysis of poll variance

The stunning feature of the opinion polls leading up to the 2019 Federal Election is that they did not look like the statistics you would expect from independent, random-sample polls of the voting population. All sixteen polls were within one percentage point of each other. As I have indicated previously, this is much more tightly clustered than is mathematically plausible. This post explores that mathematics further.

### My initial approach to testing for under-dispersion

One of the foundations of statistics is the notion that if I draw many independent and random samples from a population, the means of those many random samples will be normally distributed around the population mean (represented by the Greek letter mu $$\mu$$). This is known as the Central Limit Theorem or the Sampling Distribution of the Sample Mean. In practice, the Central Limit Theorem holds for samples of size 30 or higher.

The span or spread of the distribution of the many sample means around the population mean will depend on the size of those samples, which is usually denoted with a lower-case $$n$$. Statisticians measure this spread through the standard deviation (which is usually denoted by the Greek letter sigma $$\sigma$$). With the two-party preferred voting data, the standard deviation for the sample proportions is given by the following formula:

$$\sigma = \sqrt{\frac{proportion_{CoalitionTPP} * proportion_{LaborTPP}}{n}}$$

While I have the sample sizes for most of the sixteen polls prior to the 2019 Election, I do not have the sample size for the final YouGov/Galaxy poll. Nor do I have the sample size for the Essential poll on 25–29 Apr 2019. For analytical purposes, I have assumed both surveys were of 1000 people. The sample sizes for the sixteen polls ranged from 707 to 3008. The mean sample size was 1403.

If we take the smallest poll, with a sample of 707 voters, we can use the standard deviation to see how likely it was to have a poll result in the range 48 to 49 for the Coalition. We will need to make an adjustment, as most pollsters round their results to the nearest whole percentage point before publication.

So the question we will ask is if we assume the population voting intention for the Coalition was 48.625 per cent (the arithmetic mean of the sixteen polls), what is the probability of a sample of 707 voters being in the range 47.5 to 49.5, which would round to 48 or 49 per cent?

For samples of 707 voters, and assuming the population mean was 48.625, we would only expect to see a poll result of 48 or 49 around 40 per cent of the time. This is the area under the curve from 47.5 to 49.5 on the x-axis when the entire area under the curve sums to 1 (or 100 per cent).

We can compare this with the expected distribution for the largest sample of 3008 voters. Our adjustment here is slightly different, as the pollster, in this case, rounded to the nearest half a percentage point. So we are interested in the area under the curve from 47.75 to 49.25 per cent.

Because the sample size ($$n$$) is larger, the spread of this distribution is narrower (compare the scale on the x-axis for both charts). We would expect almost 60 per cent of the samples to produce a result in the range 48 to 49 if the population mean ($$\mu$$) was 48.625 per cent.

We can extend this technique to all sixteen polls. We can find the proportion of all possible samples we would expect to generate a published poll result of 48 or 49. We can then multiply these probabilities together to get the probability that all sixteen polls would be in the range. Using this method, I estimate that there is a one in 49,706 chance that all sixteen polls should be in the range 48 to 49 for the Coalition (if the polls were independent random samples of the population, and the population mean was 48.625 per cent).

### Chi-squared goodness of fit

Another approach is to apply a Chi-squared ($$\chi^2$$) test for goodness of fit to the sixteen polls. We can use this approach because the Central Limit Theorem tells us that the poll results should be normally distributed around the population mean. The Chi-squared test will tell us whether the poll results are normally distributed or not. In this case, the formula for the Chi-squared statistic is:

$$\chi^2 = \sum_{i=1}^k {\biggl( \frac{x_i - \mu}{\sigma_i} \biggr)}^2$$

Let's step through this equation. It is nowhere near as scary as it looks. To calculate the Chi-squared statistic, we do the following calculation for each poll:
• First, we calculate the mean deviation for the poll by taking the published poll result ($$x_i$$) and subtracting the population mean $$\mu$$, which we estimated using the arithmetic mean for all of the polls.
• We then divide the mean deviation by the standard deviation for the poll ($$\sigma_i$$), and then we
• square the result (multiply it by itself) - this ensures we get a positive statistic in respect of every poll.
Finally, we sum these ($$k=16$$) squared results from each of the polls.

If the polls are normally distributed, the absolute difference between the poll result and the population mean (the mean deviation) should be around one standard deviation on average. For sixteen polls that were normally distributed around the population mean, we would expect a Chi-squared statistic around the number sixteen.

If the Chi-squared statistic is much less than 16, the poll results could be under-dispersed. If the Chi-squared statistic is much more than 16, then the poll results could be over-dispersed. For sixteen polls (which have 15 degrees of freedom, because our estimate for the population mean ($$\mu$$) is constrained by and comes from the 16 poll results), we would expect 99 per cent of the Chi-squared statistics to be between 4.6 and 32.8.

The Chi-squared statistic I calculate for the sixteen polls is 1.68, which is much less than the expected 16 on average. I can convert this 1.68 Chi-squared statistic to a probability for 15 degrees of freedom. When I do this, I find that if the polls were truly independent and random samples, (and therefore normally distributed), there would be a one in 108,282 chance of generating the narrow distribution of poll results we saw prior to the 2019 Federal Election. We can confidently say the published polls were under-dispersed.

Note: If I was to use the language of statistics, I would say our null hypothesis ($$H_0$$) has the sixteen poll results normally distributed around the population mean. Now if the null hypothesis is correct, I would expect the Chi-squared statistic to be in the range 4.6 and 32.8 (99 per cent of the time). However, as our Chi-squared statistic is outside this range, we reject the null hypothesis for the alternative hypothesis ($$H_a$$) that collectively, the poll results are not normally distributed.

### Why the difference?

It is interesting to speculate on why there is a difference between these two approaches. While both approaches suggest the poll results were statistically unlikely, the Chi-squared test says they are twice as unlikely as the first approach. I suspect the answer comes from the rounding the pollsters apply to their raw results. This impacts on the normality of the distribution of poll results. In the Chi-squared test, I did not look at rounding.

### So what went wrong?

There are really two questions here:
• Why were the polls under-dispersed; and
• On the day, why did the election result differ from the sixteen prior poll estimates?

To be honest, it is too early to tell with any certainty, for both questions. But we are starting to see statements from the pollsters that suggest where some of the problems may lie.

A first issue seems to be the increased use of online polls. There are a few issues here:
• Finding a random sample where all Australians have an equal chance of being polled - there have been suggestions of too many educated and politically active people are in the online samples.
• Resampling the same individuals from time to time - meaning the samples are not independent. (This may explain the lack of noise we see in polls in recent years). If your sample is not representative, and it is used often, then all of your poll results would be skewed.
• An over-reliance on clever analytics and weights to try and make a pool of online respondents look like the broader population.  These weights are challenging to keep accurate and reliable over time.
More generally, regardless of the polling methodology:
• the use of weighting, where some groups are under-represented in the raw sample frame can mean that sample errors get magnified.
• not having quotas and weights for all the factors that align somewhat with cohort political differences can mean polls accidentally do not sample important constituencies.

Like Kevin Bonham, I am not a fan of the following theories
• Shy Tory voters - too embarrassed to tell pollsters of their secret intention to vote for the Coalition.
• A late swing after the last poll.

### Code snippet

To be transparent about how I approached this task, the python code snippet follows.
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

import sys
sys.path.append( '../bin' )
plt.style.use('../bin/markgraph.mplstyle')

# --- Raw data
sample_sizes = (
pd.Series([3008, 1000, 1842, 1201, 1265, 1644, 1079, 826,
2003, 1207, 1000, 826, 2136, 1012, 707, 1697]))
measurements = ( # for Labor:
pd.Series([51.5, 51,   51,   51.5, 52,   51,   52,   51,
51,   52,   51,   51,  51,   52,   51,  52]))
roundings =   (
pd.Series([0.25, 0.5,  0.5,  0.25, 0.5,  0.5,  0.5,  0.5,
0.5,  0.5,  0.5,  0.5, 0.5,  0.5,  0.5, 0.5]))

# some pre-processing
Mean_Labor = measurements.mean()
Mean_Coalition = 100 - Mean_Labor
variances = (measurements * (100-measurements)) / sample_sizes
standard_deviations = pd.Series(np.sqrt(variances)) # sigma

print('Mean measurement: ', Mean_Labor)
print('Measurement counts:\n', measurements.value_counts())
print('Sample size range from/to: ', sample_sizes.min(),
sample_sizes.max())
print('Mean sample size: ', sample_sizes.mean())

# --- Using normal distributions
print('-----------------------------------------------------------')
individual_probs = []
for sd, r in zip(standard_deviations.tolist(), roundings):
individual_probs.append(stats.norm(Mean_Coalition, sd).cdf(49.0 + r) -
stats.norm(Mean_Coalition, sd).cdf(48.0 - r))

# print individual probabilities for each poll
print('Individual probabilities: ', individual_probs)

# product of all probabilities to calculate combined probability
probability = pd.Series(individual_probs).product()
print('Overall probability: ', probability)
print('1/Probability: ', 1/probability)

# --- Chi Squared - check normally distributed - two tailed test
print('-----------------------------------------------------------')
dof = len(measurements) - 1 ### degrees of freedom
print('Degrees of freedom: ', dof)
X = pow((measurements - Mean_Labor)/standard_deviations, 2).sum()
X_min = stats.distributions.chi2.ppf(0.005, df=dof)
X_max = stats.distributions.chi2.ppf(0.995, df=dof)
print('Expected X^2 between: ', round(X_min, 2), ' and ', round(X_max, 2))
print('X^2 statistic: ', X)
X_probability = stats.chi2.cdf(X , dof)
print('Probability: ', X_probability)
print('1/Probability: ', 1 / X_probability)

# --- Chi-squared plot
print('-----------------------------------------------------------')
x = np.linspace(0, X_min + X_max, 250)
y = pd.Series(stats.chi2(dof).pdf(x), index=x)

ax = y.plot()
ax.set_title('$\chi^2$ Distribution: degrees of freedom='+str(dof))
ax.axvline(X_min, color='royalblue')
ax.axvline(X_max, color='royalblue')
ax.axvline(X, color='orange')
ax.text(x=(X_min+X_max)/2, y=0.00, s='99% between '+str(round(X_min, 2))+
' and '+str(round(X_max, 2)), ha='center', va='bottom')
ax.text(x=X, y=0.01, s='$\chi^2 = '+str(round(X, 2))+'$',
ha='right', va='bottom', rotation=90)

ax.set_xlabel('$\chi^2$')
ax.set_ylabel('Probability')

fig = ax.figure
fig.set_size_inches(8, 4)
fig.text(0.99, 0.0025, 'marktheballot.blogspot.com.au',
ha='right', va='bottom', fontsize='x-small',
fontstyle='italic', color='#999999')
fig.savefig('./Graphs/Chi-squared.png', dpi=125)
plt.close()

# --- some normal plots
print('-----------------------------------------------------------')
mu = Mean_Coalition

n = 707
low = 47.5
high = 49.5

sigma = np.sqrt((Mean_Labor * Mean_Coalition) / n)
x = np.linspace(mu - 4*sigma, mu + 4*sigma, 200)
y = pd.Series(stats.norm.pdf(x, mu, sigma), index=x)

ax = y.plot()
ax.set_title('Distribution of samples: n='+str(n)+', μ='+
str(mu)+', σ='+str(round(sigma,2)))
ax.axvline(low, color='royalblue')
ax.axvline(high, color='royalblue')
ax.text(x=low-0.5, y=0.05, s=str(round(stats.norm.cdf(low,
loc=mu, scale=sigma)*100.0,1))+'%', ha='right', va='center')
ax.text(x=high+0.5, y=0.05, s=str(round((1-stats.norm.cdf(high,
loc=mu, scale=sigma))*100.0,1))+'%', ha='left', va='center')
mid = str( round(( stats.norm.cdf(high, loc=mu, scale=sigma) -
stats.norm.cdf(low, loc=mu, scale=sigma) )*100.0, 1) )+'%'
ax.text(x=48.5, y=0.05, s=mid, ha='center', va='center')

ax.set_xlabel('Per cent')
ax.set_ylabel('Probability')

fig = ax.figure
fig.set_size_inches(8, 4)
fig.text(0.99, 0.0025, 'marktheballot.blogspot.com.au',
ha='right', va='bottom', fontsize='x-small',
fontstyle='italic', color='#999999')
fig.savefig('./Graphs/'+str(n)+'.png', dpi=125)
plt.close()

# ---
n = 3008
low = 47.75
high = 49.25

sigma = np.sqrt((Mean_Labor * Mean_Coalition) / n)
x = np.linspace(mu - 4*sigma, mu + 4*sigma, 200)
y = pd.Series(stats.norm.pdf(x, mu, sigma), index=x)

ax = y.plot()
ax.set_title('Distribution of samples: n='+str(n)+', μ='+
str(mu)+', σ='+str(round(sigma,2)))
ax.axvline(low, color='royalblue')
ax.axvline(high, color='royalblue')
ax.text(x=low-0.25, y=0.3, s=str(round(stats.norm.cdf(low,
loc=mu, scale=sigma)*100.0,1))+'%', ha='right', va='center')
ax.text(x=high+0.25, y=0.3, s=str(round((1-stats.norm.cdf(high,
loc=mu, scale=sigma))*100.0,1))+'%', ha='left', va='center')
mid = str( round(( stats.norm.cdf(high, loc=mu, scale=sigma) -
stats.norm.cdf(low, loc=mu, scale=sigma) )*100.0, 1) )+'%'
ax.text(x=48.5, y=0.3, s=mid, ha='center', va='center')

ax.set_xlabel('Per cent')
ax.set_ylabel('Probability')

fig = ax.figure
fig.set_size_inches(8, 4)