## Saturday, November 14, 2020

### Reflections on polling aggregation

After the 2019 Australian election, the Poll Bludger posted that "aggregators gonna aggregate." At the time I thought it was a fair call.

More recently, I have been pondering a paper by Xiao-Li Meng: Statistical Paradises and Paradoxes in Big Data: Law of Large Populations , Big Data Paradox, and the 2016 US Presidential Election. In a roundabout way, this paper is answering the usual question: why did the polls call the 2016 US election wrong?

A key issue explored in this paper is the extent to which the growing non-response rates in public opinion polling are degrading probabilistic sampling. Theoretically, very small probability samples, randomly selected, can be used to make reliable statements about a trait in the total population.

Non-response rates can adversely impact on the the degree to which a sample is a probability sample. This is of particular concern when there is a correlation between non-response and a variable under analysis (such as voting intention). Even a very small, non-zero correlation (say 0.005) can dramatically reduce the effective sample size. The problem is exacerbated by population size (something we can ignore when the sample is a true probability sample).

Making matters worse, this problem is not fixed by weighting sub-populations within the sample. Weighting only works when the non-response is not correlated with a variable of analytical interest.

Meng argues that the combined polls prior to 2016 election account for one per cent of the US's eligible voting population (ie. the combined sample is around 2.3 million people). However, given the non-response rate, the bias in that non-response rate,  and the size of the US population, this becomes an effective sample size in the order of 400 eligible voters. An effective sample of 400 people has a 95% confidence interval of plus or minus 4.9 percentage points (not the 0.06 percentage points you would expect from a 2.3 million combined voter sample).

Isakov and Kuriwaki used Meng's theory and a Bayesian model written in Stan to aggregate battleground state polls prior to the 2020 US presidential election to account for pollster errors in 2016. Of necessity, this work made a number of assumptions (at least one of which turned out to be mistaken). Nonetheless, it suggested that the polls in key battleground states were one percentage point too favourable to Biden, and more importantly, the margin of error was about twice as wide as reported. Isakov and Kuriwaki's model was closer to the final outcome than a naive aggregation of the publicly available polling in battleground states.

According to Isakov and Kuriwaki, Meng's work suggests a real challenge for poll aggregators. Systemic polling errors are often correlated across pollsters. If this is not accounted for by the aggregator, then over-confident aggregations can mislead community expectations in anticipation of an election result, more so than the individual polls themselves.

#### References

Isakov, M., & Kuriwaki, S. 2020. Towards Principled Unskewing: Viewing 2020 Election Polls Through a Corrective Lens from 2016. Harvard Data Science Review.

Meng, Xiao-Li. 2018. Statistical paradises and paradoxes in big data (I): Law of large populations, big data paradox, and the 2016 US presidential election. Annals of Applied Statistics 12 (2): 685–726.

## Thursday, November 12, 2020

### Report into 2019 polling failure

Yesterday, the Association of Market and Social Research Organisations, the national peak industry body for research, data and insights organisations in Australia, released its report into the polling failure for the 2019 Australian Federal Election

Unfortunately, the Australian pollsters did not share their raw data with the AMSRO inquiry. So the inquiry was constrained. Nonetheless, it found:

The performance of the national polls in 2019 met the independent criteria of a ‘polling failure’ not just a ‘polling miss’. The polls: (1) significantly — statistically — erred in their estimate of the vote; (2) erred in the same direction and at a similar level; and (3) the source of error was in the polls themselves rather than a result of a last-minute shift among voters.
The Inquiry Panel could not rule out the possibility that the uncommon convergence of the polls in 2019 was due to herding.
Our conclusion is that the most likely reason why the polls underestimated the first preference vote for the LNP and overestimated it for Labor was because the samples were unrepresentative and inadequately adjusted. The polls were likely to have been skewed towards the more politically engaged and better educated voters with this bias not corrected. As a result, the polls over-represented Labor voters.

While the report was hampered by the limited cooperation from polling companies, it is well worth reading.

## Sunday, March 1, 2020

### Is non-probability sampling a problem?

I have posted several musings on the failure of the opinion polls to accurately forecast the 2019 Federal Election. Earlier posts in this series include

In this post, I will look at the seismic shift that has occurred in polling practice over the past decade or so. Where the pollsters once used probability sampling methods, they no longer do. Slowly, one-by-one, and in-part then fully, they have embraced the wonderful world of non-probability sampling. This is actually a big deal.

But before we talk about the change and what it means, it is worth looking at why probability sampling is the gold standard in statistics. I will start with the foundation of inferential statistics: the Central Limit Theorem.

### The Central Limit Theorem

Please accept my apologies in advance, I will get a little (but hopefully not too) mathematical here. While the Central Limit Theorem (CLT) is called a theorem, it is actually a (series of) mathematical proof(s). I will spare you those proofs, but it is important to understand what the CLT promises, and the conditions where the CLT applies.

According to the CLT, a random sample drawn from a population will have a sample mean ($\bar{x}$) that approximates the mean of the parent population from which the sample has been drawn ($\mu$).
$$\bar{x}\approx\mu$$

The critical point here is that the CLT is axiomatic. Provided the pre-conditions of the mathematical proofs are met (we will come to this), this relationship between the sample mean and the population mean holds.

The CLT applies to numerical variables in a population (such as height or age). It applies to binomial population characteristics expressed as population proportions (such as the success or failure of a medical procedure). It also applies to multinomial population characteristics expressed as a set of population proportions (such as which political party a person would vote for if an election was held now: Labor, Coalition, Greens, One Nation, or another party).

The beauty of the CLT is that if multiple random samples are drawn from a population the means of those multiple random samples will be normally distributed (like a bell-curve) around the population mean, even if that characteristic in the population is not normally distributed. It is this property of a normal distribution of sample means, regardless of the population distribution, that allows us to infer the population mean with a known degree of confidence. It is because of the central limit theorem that opinion polling actually works.

But there are a couple of key caveats:
• the sample must be randomly selected, and each random selection in the sample must be independent of any other selection in the sample. Every case in the population must have the same probability of being included in the sample. In practice, it is achieving the ideal of this first caveat where most opinion polling runs into trouble.
• the CLT works for all sample sizes if the population characteristic is normally distributed. But it only works with sufficiently large samples (typically $n\gt30$) for almost all non-normal distributions. There are a very small number of theoretical population distributions where the CLT does not work: such as the Cauchy distribution, which does not have a population mean. In practice, for social statistics, this last caveat is not a problem.

I mentioned above that the CLT allows me to make a statement about the population mean, with a known degree of confidence, based on the sample mean. This known degree of confidence is called the sampling error or the margin or error.

### Sampling error

We are all familiar with pollsters reporting their errors. For example, the most recent Newspoll includes the statement:

This survey was conducted between 19-22 February and is based on 1,512 interviews among voters conducted online. The maximum sampling error is plus or minus 2.5 percentage points.

Let's spend a little time unpacking what this statement means. For a probability sample, how closely s sample mean approximates the population mean depends on the sample size ($n$). The larger the sample size, the more closely a sample is likely to approximate the population mean. For multiple samples from the same population, the larger the sample size, the tighter those sample means will be distributed around the population mean.

For proportions, to calculate the sampling error, we first calculate the standard error of the proportion. We then apply a standard-score or z-score to indicate the size of the margin of error we are comfortable with. In social statistics, the typical practice is a margin of error of 95 per cent, which lives within plus-and-minus 1.96 standard deviations of the mean of the sample:
$$SamplingError=\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}*(\pm{1.96})$$

We can apply this calculation to the sample size Newspoll reported above and confirm their published maximum sampling error. The maximum sampling error occurs when both proportions are equal: 50-50 if you will:
\begin{align}Sampling\ Error&=\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}*(\pm{1.96}) \\&=\sqrt{\frac{0.5*0.5}{1512}}*(\pm{1.96}) \\&=\pm0.0252 \\&=\pm2.5\ percentage\ points\end{align}

A 2.5 percentage point margin of error means that, if the sample was truly randomly selected, the population mean is likely to be within a confidence interval ranging between plus 2.5 percentage points and minus 2.5 percentage points of the sample mean 95 per cent of the time. Given their headline report of 51 to 49 per cent of the two-party preferred vote share in Labor's favour, Newspoll is actually saying there are 19 chances in 20 that the population vote share for Labor is between 53.5 and 48.5 per cent. It is giving the same chances that the population vote share for the Coalition is between 51.5 and 46.5 per cent.

There is some irony here. For this poll, Newspoll did not use a probability sample, where every member of the population of voting-aged Australians had an equal non-zero probability of being included in the sample. My probability of being polled by Newspoll was zero. I am not a member of the panel of respondents Newspoll uses. My probability of being polled by Essential is also zero, for much the same reason. And, if it was not a probability sample, it is impossible to find the sampling error (which is an artefact of probability sampling) or even to talk about confidence intervals.

But I get ahead of my self. I will come back to the current approach used by Newspoll and Essential in a minute. First, it is important to understand that when pollsters were using probability sampling, it was not perfect, and these imperfections necessitated adjustments.

### The challenge of constructing a robust probability sample

Before online polling, the two main forms of polling were face-to-face polling and telephone polling.

National face-to-face polling is expensive. To manage travel costs, pollsters used stratified polling techniques: where they randomly selected several polling locations, and then randomly visited several households in each location. this meant the ideal of independent random selection was breached. The second person polled was not selected totally at random. While the second person was partly selected at random, in part, they were selected because they lived near the first person polled. It is also worth noting that today the national building stock includes many more high-rise apartments in security buildings than it once did and polling people in these properties is challenging in practice.

When landline telephones became direct-dial and almost ubiquitous (sometime in the 1970s/1980s), pollsters used the pool of all possible telephone numbers or the telephone directory as a proxy for the Australian voting population. This practice was cheaper than face-to-face polling, and for a few decades at least, provided reasonably robust poll results. However, it was not perfect. Not every household had a phone. The probability of selection differed between multi-person and single-person households. People with active social lives or working shift-work were less likely to be polled than homebodies.

Pollsters addressed these problems by dividing their sample into mutually exclusive sub-cohorts (or strata) using criteria that are known to correlate with different propensities to vote in one way or another (such as gender, age, educational attainment, ethnicity, location, and/or income). The pollsters would then weight the various sub-cohorts (or strata) within the sample by their population proportion (using ABS census data). The pollsters argued that these weights would correct for any bias from having (say) too many old people in the response set.

Both polling methods also had issues of non-response. Where a person refused to participate in a survey, the issue would be addressed by sub-cohort weightings. Another form of non-response is when a respondent says they "don't know" who they would vote for. Pollsters assumed these "don't know" respondents would likely vote in the same proportions as those respondents that did have a voting preference. To the extent that the assumption was true, this would reduce the effective sample size and increase the margin of error, which in turn increases the variance associated with making an accurate population estimate. Where the assumption was false, it would bias the population estimate. Either way, "don't know" respondents increase the error associated with the published population estimate.

While such adjustments mostly worked back then, it is important to know, that the more adjustments that are made, the less confidence we can have in the resulting population estimates.

In 2020, national face-to-face polling remains expensive. Telephone polling is much less reliable than it was only 15 years ago. There are several reasons for this (discussed in more detail in my previous post), but in summary:
• with the rise of mobile phones and voice-over-internet-protocols (VOIP), far fewer households have traditional landlines
• mobile phones are hard to poll randomly as the government does not publish a directory of mobile and VOIP numbers (and mobile and VOIP numbers are not structured to geography in the way landlines once were)
• with the rise of annoying robocalls - and the constant requests for phone, SMS and email feedback [please rate the service you received this morning] - the cultural willingness of people to answer surveys has declined substantially
• As response rates have declined, the cost of phone surveys have increased

With its inherently higher costs and modern building practice making face-to-face polling challenging, and the declining response rates and rising costs of telephone polling, the pollsters looked for another cost-effective approach. They have turned to non-probability quota sampling and online panel polls.

### Non-probability quota sampling

Quota sampling depends on the pollster putting together a sample by design. Typically, such as sample would have the same characteristics as the population. If just over half the population is women, then just over half the quota sample would be women. And so on for a series of other demographic characteristics such as age groups, educational attainment, urban/regional/rural location; etc. Typically pollsters use the latest Australian Bureau of Statistics (ABS) census data to construct the control quotas used in building the sample.

However, sample members are not selected from the population through a random process. Typically they are selected from a larger panel which was constituted by the pollster through (for example) web-based advertising, snowballing (where existing panel members are asked to identify people they know who might want to be on a panel), and other pollster-managed processes. The art (not science) here is for the pollster to construct a sample that looks like it was selected at random from the larger population. Note: even if a quota is constructed at random from a larger panel, it is still not a probability sample of the population.

Quota samples can be contrasted favourably with less robust non-probability sampling techniques such as:
• self-selecting samples, where the respondents choose whether to participate or not in a poll, and
• convenience samples, where the pollster/interviewer approaches respondents (typically close at hand), without any particular quotas to be met.

But quota sampling, even when done well, is less robust than good quality probability sampling. The sample frame for probability sampling is the entire population. The sample frame for quota sampling is a much smaller panel of respondents maintained by the pollster. Not all of the population has a chance of being polled with quota sampling. Does this make a difference? In my mind, there are two conflicting answers to this question.
• Yes, there is an important distinction. By seeking to apply the ideal of a probability sample, pollsters rely on the axioms of mathematics to make a population estimate. With a quota sample, there is no axiomatic relationship between the sample mean and the population mean. Strictly speaking, it is not justifiable to make a population estimate from a survey of a quota sample. While quota sampling may work somewhat in practice, the process is one-step removed from being mathematically grounded, and it is much more susceptible to sample selection biases (which we will discuss further below).
• But this might be less important in practice. If (as happened with telephone polling) response rates become very small, and there are no cost-effective methods for implementing a robust probability sample, quota sampling might be a pragmatic alternative to a very low response rate probability-sample polling. But this comes with a price: higher errors in poll estimates.

Nonetheless, quota sampling for internet polling is not without its challenges.

### Issues with quota sampling and internet polling

The first thing to note is that, according to Granwal (2019), 88 per cent of the Australian population is an active user of the internet. While this proportion might appear high, it is not 100 per cent. Furthermore, Thomas et. al. (2018), have identified particular cohorts that are under-represented in their internet use:
• households with income under $35,000/year (quintile 5) • people whose only access to the internet is with mobile phones • people aged 65 years or more • people who did not complete secondary education • people with a disability • households with income between$35,000 and \$60,000/year (quintile 4)
• people not in the labour force
• Indigenous Australians
• people aged between 50 and 64 years

Of necessity, this suggests there will be selection bias issues with any sampling approach that depends on data collection via the internet.

Furthermore, research by Heen, Lieberman and Miethe (2014), suggests that some polling organisations do poorly at constructing samples that reflect sub-cohort proportions in the population against standard demographic characteristics. It is unclear to what extent this might be a problem with election polling in Australia.

Frequent survey responders are another problem with quota sampling from a limited panel. There are a couple of dimensions to this problem. The first is that frequent serial responders may respond differently to other responders (for example, their views may become frozen if they frequently answer the same question, or they may devote less time to question answering over time). The second is that parallel frequent responders may exist because they lie and cheat to be listed multiple times under different aliases on multiple pollsters' panels (Whitsett 2013). Parallel frequent responders may do so for political reasons (to deliberately skew poll reports) or they may do this for the financial rewards associated with survey completion.

In the context of Australian election polling, the existence of frequent responders has been suggested as an explanation for the underdispersion that has been seen at times from pollsters who have adopted quota sampling approaches.

While pollsters have used financial incentives as a method to increase participation rates, it may skew participation among low-income cohorts. It could also lead to a cluster of "professional respondents" in the data - frequent serial respondents who complete surveys for the money. Handorf et. al. (2018) notes that the offering of incentives increases repeat-responders, which in turn reduces the quality of survey data. De Leeuw and Matthisse (2016) note that professional respondents "are more often female, slightly less educated and less often gainfully employed".

Drawing a number of the above themes together, Lee-Young (2016) suggests that pollsters using online quota sampling need to weed out a range of "dishonest" respondents:
• professional competition entrants - the frequent responders noted above
• survey speeders - who complete the survey in record time (and with insufficient consideration)
• straight-liners - who give the same (Likert-scale) response to every question
• block-voters - who give the same answer as each other, to advocate through polling

Another list from Gut Check It (2016), suggests pollsters need to watch out for: professionals, rule-breakers, speeders, straight-liners, cheaters (survey bots and multiple alias responders), and posers (caught in social desirability bias).

Because a quota sample is constructed by the pollster rather than randomly selected from the population, these problems are not problems typically faced in probability sampling. Without careful attention to how the quota sample is constructed, pollsters run a significant risk of increased bias.

### Conclusion

The question this post explores is whether the shift to non-probability sampling over the past fifteen years or so was a contributing factor to the polling failure before the 2019 Australian Federal Election. The short answer is: quite probably.

But we cannot blame pollsters for this and simply go back to the approaches that worked in the eighties, nineties, and naughties. Back then, it was relatively easy to construct telephone polls with robust probability sampling. However, the rise of mobile phones and VOIP technology, and changing community attitudes have disrupted the pollsters. As a result, thepollsters can no longer construct cheap and workable phone polls and probability frames with high response rates. And by embracing quota sampling, the pollsters have told us they believe that the high-quality polling accuracy we had enjoyed in Australia since the mid-1980s is no longer possible (or perhaps, no longer affordable).

While pollsters are caught between Scylla and Charybdis, we the poll-consuming public need to come to terms with the increased likelihood of more poll misses in the future. Today's polls have more inherent error than we are used to.

• Australian Bureau of Statistics, Sample Design, ABS Website, 2016.

## Saturday, February 22, 2020

### Aggregated poll update

It has been a couple of months since I last did an aggregated poll update. So it is time for a quick update. [Edit: charts updated on 24 February to include the latest Newspoll].

If we start with the two-party preferred (TPP) polls: six months after the previous election we are in the remarkable position of only having poll estimates from Newspoll. No other polling firm has published a national TPP estimate since the polling failure of the 2019 Federal Election. So there is nothing to aggregate.

We can, however, apply the model assumption that population voting intentions are slow to change over time. This will smooth the noise from the Newspoll TPP series. The result is a slow decline in the estimated TPP vote share for the Coalition since the last election, which accelerated over December/January.

On the primary vote side, the Newspoll series has been augmented by a single ANUpoll, from the Centre for Social Research and Methods at the Australian National University. Reading the ANUpoll report closely, it looks there was an earlier poll in October 2019 as well, but I have not seen the details of that poll.

For the purposes of aggregation, I have assumed that the two polls are unbiased on average. That is to say, their house effects sum to zero. This is a big assumption. Furthermore, while we can technically aggregate the individual polls from ANUpoll and Newspoll, with only one poll from ANUpoll, I would urge caution. It is better to focus on the trends (up, down or stationary), rather than the actual numbers. In that vein, the Coalition's primary vote estimate is down. All of the other parties/groups are up since the 2019 election.

The house effect charts show the difference between Newspoll and ANUpoll. Although it needs to be repeated that this result is based on just a single ANUpoll (which may or may not be an outlier). The ANUpoll was less favourable to the Coalition and Other parties, and more favourable to Labor and the Greens. Note: a house effect of -1.5 and 1.5 from the two firms for the Greens means these polling firms are on average three percentage points apart.

Finally, we turn to the attitudinal polling. Both Newspoll and Essential have published regular attitudinal estimates. In respect of the preferred Prime Minister, the key trend has been Scott Morrison's decline.

On the satisfaction side, we have also seen a decline in the satisfaction with the Prime Minister. The Prime Minister now has a net-negative satisfaction score.

I have used the polling data from Wikipedia to prepare these charts.

## Monday, January 20, 2020

### Polls, the bias-variance trade-off and the 2019 Election

Data scientists typically want to minimise the error associated with their predictions. Certainly, pollsters want their opinion poll estimates immediately prior to an election to reflect the actual election outcome as closely as possible.

However, this is no easy task. With many predictive models, there is often a trade-off between the error associated with bias and the error associated with variance. In data science, and particularly in the domain of machine learning, this phenomenon is known as the bias-variance trade-off or the bias-variance dilemma.

According to the bias-variance trade-off model, less than optimally tuned predictive models often either deliver high bias and low variance predictions, or the opposite: low bias and high variance predictions. At this stage, a chart might help with your intuition, and I will take a little time to define what I mean by bias and variance.

### Two types of prediction error: bias and variance

Bias tells us how closely our predictions are typically, or on average, aligned with the true value for the population or the actual outcome. In terms of the above bullseye diagram, how closely do our predictions on average land in the very centre of the bullseye?

Variance tells us about the variability in our predictions. It tells us how tightly our individual predictions cluster. In other words, variance indicates how far and wide our predictions are spread. Do they span a small range or a large range?

In our 2x2 bullseye grid above, a predictive model (or in the case of the 2019 opinion polls a system of predictive models) can deliver predictions with high or low variance and high or low bias.

### In 2019, the opinion polls exhibited high bias and low variance

My contention is that collectively, the opinion polling in the six-week lead-up to the 2019 Australian Federal election exhibited high bias and low variance. It was in the top lefthand quadrant of the bulls-eye diagram above.

As evidence for this contention, I would note: Kevin Boneham has argued that by Australian standards, the opinion polls leading up to the 2019 election had their biggest miss since the mid-1980s. Also, I have previously commented on the extraordinarily low variance among the opinions polls prior to the 2019 election.

### Changes in statistical practice and social realities

Before we come to the implications of the high bias/low variance polling outcome in 2019, it is worth reflecting how polling practice has been driven to change in recent times. In particular, I want to highlight the increased importance of data analytics in the production of opinion poll estimates.

Newspoll was famous for introducing an era of reliable and accurate polling in the Australian context. With the near-universal penetration of land-line telephones to Australian households by the 1980s, pollsters were able to develop sample frames that came close to the Holy Grail of giving every member of the Australian voting public an equal chance of being included in any survey the pollster run.

While pollsters came close, their sampling frames were not perfect, and weightings and adjustments were still needed for under-represented groups. Nonetheless, the extent to which weightings and analytics were needed to make up for shortfalls in the sample frame was small compared with today.

So what has changed since the mid-1980s? Quite a few things, here are but a few:

• the general population use of landlines has declined. In some cases, the plain old telephone system (POTS) has been replaced by a voice over internet protocol (VOIP) service. Many households have terminated their landline service altogether, in favour of the mobile phone. Younger people in share houses are very unlikely to have a shared telephone landline or even a VOIP service. Further, mobile and VOIP numbers are not listed in the White Pages by default. As a consequence, constructing a telephone-based sample frame is much more challenging than it once was.
• the rise of the digital technologies has seen a growth in robocalls, which many people find annoying. Caller identification on phones has given rise to the practice of call screening. If a polling call is taken, busy people and those not interested in politics often just hang up. As a result of these and other factors, participation rates in telephone polls have plummeted. In the United States, Pew Research reported a decline in telephone survey responses from 36 per cent in 1997 to just 6 per cent in 2018. Newspoll's decision to abandon telephone polling suggests something similar may have happened in Australia
• the newspapers that purchase much of the publicly available opinion polling have had their budgets squeezed, and they are wanting to spend less on polling. Ironically, lower telephone survey response rates (as noted above) are driving up the costs of telephone surveys. As a consequence, pollsters have had to turn to lower-cost (and more challenging) sampling techniques, such as internet surveys.
• While more people may be online these days (as Newspoll argues), online participation remains less widespread than landlines in the mid-1980s.

Because of the above, pollsters need to work much harder to address the short-comings in their sampling frame. Pollsters need to make far more data-driven adjustments to today's raw poll results than happened in the mid-1980s. This work is far more complex than simply weighting for sub-cohort population shares. Which brings us back to the bias-variance trade-off.

### What the bias-variance trade-off outcome suggests

The error for a predictive model can be thought of as the sum of the squared differences between the true value and all of the predicted values from the model

$$Error = \sum(predicted_y - true_y)^2$$

The critical insight from the bias-variance trade-off is that this error can be decomposed into three components:

$$Error = Bias^2 + Variance + Irreducible Error$$

We have already discussed bias and variance. The irreducible error is that error that cannot be removed by developing a better predictive model. All predictive models will have some noise-related error.

Simple models tend to underfit the data. They do not capture enough of the information or "signal" available in the data. As a result, they tend to have low variance, but high bias.

Complex models tend to overfit the data. Because they typically have many parameters, they tend to treat noise in the data as the signal. This tends to result in predictive models that have low bias but high variance (driven by the noise in the data).

Again, to help develop our intuition (and to avoid the complex mathematics), let's turn to a visual aid to get a sense of what I mean by under-fit and over-fit a predictive model to the data.

The data below (the red dots) comes from a randomly perturbed sine wave between 0 and $\pi$. I have three models that look at those red dots (the perturbed data) and try to predict the form or shape of the underlying curve:
• a simple linear regression - which predicts a straight line
• a quadratic regression - which predicts a parabola
• a high-order polynomial regression - which can predict quite a complicated line

The linear regression (the straight blue line, top left) provides an under-fitted model for the curve evident in the data. The quadratic regression (the inverted parabola, bottom left) provides a reasonably good model. It is the closest of the models to a sine wave between the values of 0 and $\pi$. And the polynomial regression of degree 25 (the squiggly blue line, bottom right) over-fits the data and reflects some of the random noise introduced by the perturbation process.

The next chart compares the three blue models above, with the actual sine wave before the perturbation. The closer the blue line is to the red line, the less error the model has.

The model with the least error between the predicted line and the sine curve is the parabola. The straight line and the squiggly line both have differences with the actual sine curve for most of their length.

When we run these three predictive models many times we can see something about the type of error for these three models. In the next chart, we have generated a randomly perturbed set of points 1000 times (just like we did above) and asked each of the models to predict the underlying line 1000 times. Each of the predictions from the 1000 model runs has then been plotted on the next chart. For reference, the sine curve the models are trying to predict is over-plotted in light-grey.

Looking at the multiple runs, the bias and variance for the distribution of predictions from the three models become clear:

• If we look at the blue linear model in the top left, we see high bias (the straight-line predictions from this model have significant points of deviation from the sine curve). But the 1000 model predictions are tightly clustered showing low variance across the predictions. The distribution of predictions from the under-fitted model shows high bias and low variance.
• The green quadratic model in the bottom left exhibits low bias. The 1000 model predictions follow closely to the sine curve. They are also tightly clustered showing low variance. The distribution of predictions from the optimally-fitted model shows low bias and low variance. These predictions have the least total error.
• The red high-order polynomial model in the bottom right exhibits low bias: on average, the 1000 predictions are close to the sine curve. However, the multiple runs exhibit quite a spread of predictions compared with the other two models. The distribution of predictions from the over-fitted model shows a low bias, but high variance.

Coming back to the three-component cumulative error equation above, we can see the least total error occurs at an optimal point between a simple and complex model. Again, let's use a stylised chart to aid intuition.

As model complexity increases, the prediction error associated with bias decreases, but the prediction error from variance increases. This minimal total error sweet-spot (between simplicity and complexity) is the key insight of the bias-variance trade-off model. This is the best trade-off between the error from prediction bias and the error from prediction variance.

### So what does this mean for the polls before the 2019 election?

The high bias and low variance in the 2019 poll results suggest that collectively the pollsters had not done enough to ensure their polling models produced optimal election outcome predictions. In short, their predictive models were not complex enough to adequately overcome the sampling frame problems (noted above) that they were wrestling with.

Without transparency from the pollsters, it is hard to know specifically where the pollsters' approach lacked the necessary complexity.

Nonetheless, I have wondered to what extent the pollsters use each other (either purposefully or through confirmation bias) to benchmark their predictions and adjust for the (perceived) bias that arises from issues with their sampling frames. In effect, mutual cross-validation, should it be occurring, would effectively simplify their collective models.

I have also wondered whether the pollsters look at non-linearity effects between their sample frame and changes in voting intention in the Australian public. If the samples that the pollsters are working with are more educated and more rusted on to a political party on average, their response to changing community sentiment may be muted compared with the broader community. For example, a predictive model that has been validated when community sentiment is at (say) 47 or 48 per cent, may not be reliable when community sentiment moves to 51 per cent.

I have wondered whether the pollsters built rolling averages or some other smoothing technique into their predictive models. Again, if so, this could be a source of simplicity.

I have wondered whether the pollsters are cycling through the same survey respondents over and over again. Again, if so, this could be a source of simplicity.

I have many questions as to what might have gone wrong in 2019. The bias-variance trade-off model provides some suggestion as to what the nature of the problem might have been.

## Friday, December 6, 2019

### Aggregated attitudinal polling

At this point in the election cycle, only Newspoll is publishing primary vote share and two-party preferred population estimates. So there is nothing to aggregate across polling houses when it comes to voting intention. However, both Essential and Newspoll are publishing attitudinal polling. So I decided to build a Dirichlet-multinomial process model to see what trends there are in the attitudinal polling since the 2019 election.

First, however, we will look at the output from the model, before looking at the model itself.

Let's begin with the preferred prime minister polling. We see a small dip in the proportion of the population preferring the Prime Minister over the period (from 45.4 to 44.9 per cent). The Opposition Leader has improved a little over the period (from 26 to 29 per cent), but he is much less preferred than the Prime Minister. The "undecideds" have declined a little (from 29 to 26 per cent).

The median lines from the above charts can be combined on a chart as follows.

The model allows us to compare house effects in preferred Prime Minister polling. Those polled by Essential are more likely to express a preference on their preferred prime minister compared with the other two houses.

The next set of charts are about satisfaction with the Prime Minister's performance. Satisfaction with the Prime Minister has declined from 48 to 45 per cent. Dissatisfaction has increased from 37 to 44 per cent.

Satisfaction with the Opposition Leader has improved from 37 to 38 per cent. Dissatisfaction has increased from 30 to 36 per cent. Undecideds have decreased from 32 to 25 per cent.

In summary, both leaders have seen a decline in net satisfaction. On this metric, the Prime Minister has fallen further than the Opposition Leader. The Opposition Leader ends the year with a higher net satisfaction rating compared with the Prime Minister.

The model that produced the above charts is as follows.

// STAN: Simplex Time Series Model
//  using a Dirichlet-multinomial process

data {
// data size
int<lower=1> n_polls;
int<lower=1> n_days;
int<lower=1> n_houses;
int<lower=1> n_categories;

// key variables
int<lower=1> pseudoSampleSize; // maximum sample size for y
real<lower=1> transmissionStrength;

// give a rough idea of a staring point ...
simplex[n_categories] startingPoint; // rough guess at series starting point
int<lower=1> startingPointCertainty; // strength of guess - small number is vague

// poll data
int<lower=0,upper=pseudoSampleSize> y[n_polls, n_categories]; // a multinomial
int<lower=1,upper=n_houses> house[n_polls]; // polling house
int<lower=1,upper=n_days> poll_day[n_polls]; // day polling occured
}

parameters {
simplex[n_categories] hidden_voting_intention[n_days];
}

transformed parameters {
for(p in 1:n_categories) // included parties sum to zero
for(h in 1:n_houses) // included parties sum to zero
}

model{
// -- house effects model
for(h in 1:n_houses)

// -- temporal model
hidden_voting_intention[1] ~ dirichlet(startingPoint * startingPointCertainty);
for (day in 2:n_days)
hidden_voting_intention[day] ~
dirichlet(hidden_voting_intention[day-1] * transmissionStrength);

// -- observed data model
for(poll in 1:n_polls)
y[poll] ~ multinomial(hidden_voting_intention[poll_day[poll]] +
}

The model assumes that house effects sum to zero (both across polling houses and across the simplex categories). I set the startingPointCertainty variable to 10. The prior on the startingPoints is 0.333 for each series. The day-to-day transmissionStrength is set to 50,000 (attitudes yesterday are much the same as today). The pseudoSampleSize is set to 1000.

As usual, the data for this analysis has been sourced from Wikipedia.

## Saturday, June 29, 2019

### Three anchored models

I have three anchored models for the period 2 July 2016 to 18 May 2019. The first is anchored to the 2016 election result (left anchored). The second model is anchored to the 2019 election result (right anchored). The third model is anchored to both election results (left and right anchored).  Let's look at these models.

The first thing to note is that the median lines in the left-anchored and right-anchored models are very similar. It is pretty much the same line moved up or down by 1.4 percentage points. As we have discussed previously, this difference of 1.4 percentage points is effectively a drift in the collective polling house effects over the period from 2016 to 2019. The polls opened after the 2016 election with a collective 1.7 percentage point pro-Labor bias. This bias grew by a further 1.4 percentage points to reach 3.1 percentage points at the time of the 2019 election (the difference between the yellow line and the blue/green lines on the right hand side of the last chart above).

The third model: the left-and-right anchored model forces this drift to be reconciled within the model (but without any guidance from the model). The left-and-right anchored model explicitly assumes there is no such drift (ie. house effects are constant and unchanging). In handling this unspecified drift, the left-and-right anchored model has seen much of the adjustment occur close to the two anchor points at the left and right extremes of the chart. The shape of the middle of the chart is not dissimilar to the singularly anchored charts.

While this is the output for the left-and-right anchored model, I would advise caution in assuming that the drift in polling house effects actually occurred in the period immediately after the 2016 election and immediately prior to the 2019 election. It is just that this is the best mathematical fit for a model that assumes there has been no drift. The actual drift could have happened slowly over the entire period, or quickly at the beginning, somewhere in the middle, or towards the end of the three year period.

My results for the left-and-right anchored model are not dissimilar to Jackman and Mansillo. The differences between our charts are largely a result of how I treat the day-to-day variance in voting intention (particularly following the polling discontinuity associated with the leadership transition from Turnbull to Morrison). I chose to specify this variance, rather than model it as a hyper-prior.  I specified this parameter because: (a) we can observe higher volatility immediately following discontinuity events, and (b) the sparse polling results in Australia, especially in the 2016-19 period, produces an under-estimate for this variance in this model.

All three models have a very similar result for the discontinuity event itself: an impact just under three percentage points. Note: these charts are not in percentage points, but vote shares.

And just to complete the analysis, let's look at the house effects. With all of these houses effects, I would urge caution. These house effects are an artefact of the best fit in models that do not allow for the 1.4 percentage point drift in collective house effects that occurred between 2016 and 2019.

The three models are almost identical.
// STAN: Two-Party Preferred (TPP) Vote Intention Model
//     - Fixed starting-point

data {
// data size
int n_polls;
int n_days;
int n_houses;

// assumed standard deviation for all polls
real pseudoSampleSigma;

// poll data
vector[n_polls] y; // TPP vote share
int house[n_polls];
int day[n_polls];

// period of discontinuity event
int discontinuity;
int stability;

// election outcome anchor point
real start_anchor;
}

transformed data {
// fixed day-to-day standard deviation
real sigma = 0.0015;
real sigma_volatile = 0.0045;

// house effect range
real lowerHE = -0.07;
real upperHE = 0.07;

// tightness of anchor points
real tight_fit = 0.0001;
}

parameters {
vector[n_days] hidden_vote_share;
vector[n_houses] pHouseEffects;
real disruption;
}

model {
// -- temporal model [this is the hidden state-space model]
disruption ~ normal(0.0, 0.15); // PRIOR
hidden_vote_share[1] ~ normal(start_anchor, tight_fit); // ANCHOR

hidden_vote_share[2:(discontinuity-1)] ~
normal(hidden_vote_share[1:(discontinuity-2)], sigma);

hidden_vote_share[discontinuity] ~
normal(hidden_vote_share[discontinuity-1]+disruption, sigma);

hidden_vote_share[(discontinuity+1):stability] ~
normal(hidden_vote_share[discontinuity:(stability-1)], sigma_volatile);

hidden_vote_share[(stability+1):n_days] ~
normal(hidden_vote_share[stability:(n_days-1)], sigma);

// -- house effects model - uniform distributions
pHouseEffects ~ uniform(lowerHE, upperHE); // PRIOR

// -- observed data / measurement model
y ~ normal(pHouseEffects[house] + hidden_vote_share[day],
pseudoSampleSigma);
}

// STAN: Two-Party Preferred (TPP) Vote Intention Model
//     - Fixed end-point only

data {
// data size
int n_polls;
int n_days;
int n_houses;

// assumed standard deviation for all polls
real pseudoSampleSigma;

// poll data
vector[n_polls] y; // TPP vote share
int house[n_polls];
int day[n_polls];

// period of discontinuity event
int discontinuity;
int stability;

// election outcome anchor point
real end_anchor;
}

transformed data {
// fixed day-to-day standard deviation
real sigma = 0.0015;
real sigma_volatile = 0.0045;

// house effect range
real lowerHE = -0.07;
real upperHE = 0.07;

// tightness of anchor points
real tight_fit = 0.0001;
}

parameters {
vector[n_days] hidden_vote_share;
vector[n_houses] pHouseEffects;
real disruption;
}

model {
// -- temporal model [this is the hidden state-space model]
disruption ~ normal(0.0, 0.15); // PRIOR
hidden_vote_share[1] ~ normal(0.5, 0.15); // PRIOR

hidden_vote_share[2:(discontinuity-1)] ~
normal(hidden_vote_share[1:(discontinuity-2)], sigma);

hidden_vote_share[discontinuity] ~
normal(hidden_vote_share[discontinuity-1]+disruption, sigma);

hidden_vote_share[(discontinuity+1):stability] ~
normal(hidden_vote_share[discontinuity:(stability-1)], sigma_volatile);

hidden_vote_share[(stability+1):n_days] ~
normal(hidden_vote_share[stability:(n_days-1)], sigma);

// -- house effects model - uniform distributions
pHouseEffects ~ uniform(lowerHE, upperHE); // PRIOR

// -- observed data / measurement model
y ~ normal(pHouseEffects[house] + hidden_vote_share[day],pseudoSampleSigma);
end_anchor ~ normal(hidden_vote_share[n_days], tight_fit); //ANCHOR
}

// STAN: Two-Party Preferred (TPP) Vote Intention Model
//     - Fixed starting-point and end-point

data {
// data size
int n_polls;
int n_days;
int n_houses;

// assumed standard deviation for all polls
real pseudoSampleSigma;

// poll data
vector[n_polls] y; // TPP vote share
int house[n_polls];
int day[n_polls];

// period of discontinuity event
int discontinuity;
int stability;

// election outcome anchor point
real start_anchor;
real end_anchor;
}

transformed data {
// fixed day-to-day standard deviation
real sigma = 0.0015;
real sigma_volatile = 0.0045;

// house effect range
real lowerHE = -0.07;
real upperHE = 0.07;

// tightness of anchor points
real tight_fit = 0.0001;
}

parameters {
vector[n_days] hidden_vote_share;
vector[n_houses] pHouseEffects;
real disruption;
}

model {
// -- temporal model [this is the hidden state-space model]
disruption ~ normal(0.0, 0.15); // PRIOR
hidden_vote_share[1] ~ normal(start_anchor, tight_fit); // ANCHOR

hidden_vote_share[2:(discontinuity-1)] ~
normal(hidden_vote_share[1:(discontinuity-2)], sigma);

hidden_vote_share[discontinuity] ~
normal(hidden_vote_share[discontinuity-1]+disruption, sigma);

hidden_vote_share[(discontinuity+1):stability] ~
normal(hidden_vote_share[discontinuity:(stability-1)], sigma_volatile);

hidden_vote_share[(stability+1):n_days] ~
normal(hidden_vote_share[stability:(n_days-1)], sigma);

// -- house effects model - uniform distributions
pHouseEffects ~ uniform(lowerHE, upperHE); // PRIOR

// -- observed data / measurement model
y ~ normal(pHouseEffects[house] + hidden_vote_share[day],
pseudoSampleSigma);
end_anchor ~ normal(hidden_vote_share[n_days], tight_fit); //ANCHOR
}

Update: Kevin Bonham is also exploring what public voting intention might have looked like during the 2016-19 period.