Mark the Ballot: Is non-probability sampling a problem?

I have posted several musings on the failure of the opinion polls to accurately forecast the 2019 Federal Election. Earlier posts in this series include

In this post, I will look at the seismic shift that has occurred in polling practice over the past decade or so. Where the pollsters once used probability sampling methods, they no longer do. Slowly, one-by-one, and in-part then fully, they have embraced the wonderful world of non-probability sampling. This is actually a big deal.

But before we talk about the change and what it means, it is worth looking at why probability sampling is the gold standard in statistics. I will start with the foundation of inferential statistics: the Central Limit Theorem.

The Central Limit Theorem

Please accept my apologies in advance, I will get a little (but hopefully not too) mathematical here. While the Central Limit Theorem (CLT) is called a theorem, it is actually a (series of) mathematical proof(s). I will spare you those proofs, but it is important to understand what the CLT promises, and the conditions where the CLT applies.

According to the CLT, a random sample drawn from a population will have a sample mean ($\bar{x}$) that approximates the mean of the parent population from which the sample has been drawn ($\mu$).

$$\bar{x}\approx\mu$$

The critical point here is that the CLT is axiomatic. Provided the pre-conditions of the mathematical proofs are met (we will come to this), this relationship between the sample mean and the population mean holds.

The CLT applies to numerical variables in a population (such as height or age). It applies to binomial population characteristics expressed as population proportions (such as the success or failure of a medical procedure). It also applies to multinomial population characteristics expressed as a set of population proportions (such as which political party a person would vote for if an election was held now: Labor, Coalition, Greens, One Nation, or another party).

The beauty of the CLT is that if multiple random samples are drawn from a population the means of those multiple random samples will be normally distributed (like a bell-curve) around the population mean, even if that characteristic in the population is not normally distributed. It is this property of a normal distribution of sample means, regardless of the population distribution, that allows us to infer the population mean with a known degree of confidence. It is because of the central limit theorem that opinion polling actually works.

But there are a couple of key caveats:

the sample must be randomly selected, and each random selection in the sample must be independent of any other selection in the sample. Every case in the population must have the same probability of being included in the sample. In practice, it is achieving the ideal of this first caveat where most opinion polling runs into trouble.

the CLT works for all sample sizes if the population characteristic is normally distributed. But it only works with sufficiently large samples (typically $n\gt30$) for almost all non-normal distributions. There are a very small number of theoretical population distributions where the CLT does not work: such as the Cauchy distribution, which does not have a population mean. In practice, for social statistics, this last caveat is not a problem.

I mentioned above that the CLT allows me to make a statement about the population mean, with a known degree of confidence, based on the sample mean. This known degree of confidence is called the sampling error or the margin or error.

Sampling error

We are all familiar with pollsters reporting their errors. For example, the most recent Newspoll includes the statement:

This survey was conducted between 19-22 February and is based on 1,512 interviews among voters conducted online. The maximum sampling error is plus or minus 2.5 percentage points.

Let's spend a little time unpacking what this statement means. For a probability sample, how closely s sample mean approximates the population mean depends on the sample size ($n$). The larger the sample size, the more closely a sample is likely to approximate the population mean. For multiple samples from the same population, the larger the sample size, the tighter those sample means will be distributed around the population mean.

For proportions, to calculate the sampling error, we first calculate the standard error of the proportion. We then apply a standard-score or z-score to indicate the size of the margin of error we are comfortable with. In social statistics, the typical practice is a margin of error of 95 per cent, which lives within plus-and-minus 1.96 standard deviations of the mean of the sample:

$$SamplingError=\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}*(\pm{1.96})$$

We can apply this calculation to the sample size Newspoll reported above and confirm their published maximum sampling error. The maximum sampling error occurs when both proportions are equal: 50-50 if you will:

$$\begin{align}Sampling\ Error&=\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}*(\pm{1.96}) \\&=\sqrt{\frac{0.5*0.5}{1512}}*(\pm{1.96}) \\&=\pm0.0252 \\&=\pm2.5\ percentage\ points\end{align}$$

A 2.5 percentage point margin of error means that, if the sample was truly randomly selected, the population mean is likely to be within a confidence interval ranging between plus 2.5 percentage points and minus 2.5 percentage points of the sample mean 95 per cent of the time. Given their headline report of 51 to 49 per cent of the two-party preferred vote share in Labor's favour, Newspoll is actually saying there are 19 chances in 20 that the population vote share for Labor is between 53.5 and 48.5 per cent. It is giving the same chances that the population vote share for the Coalition is between 51.5 and 46.5 per cent.

There is some irony here. For this poll, Newspoll did not use a probability sample, where every member of the population of voting-aged Australians had an equal non-zero probability of being included in the sample. My probability of being polled by Newspoll was zero. I am not a member of the panel of respondents Newspoll uses. My probability of being polled by Essential is also zero, for much the same reason. And, if it was not a probability sample, it is impossible to find the sampling error (which is an artefact of probability sampling) or even to talk about confidence intervals.

But I get ahead of my self. I will come back to the current approach used by Newspoll and Essential in a minute. First, it is important to understand that when pollsters were using probability sampling, it was not perfect, and these imperfections necessitated adjustments.

The challenge of constructing a robust probability sample

Before online polling, the two main forms of polling were face-to-face polling and telephone polling.

National face-to-face polling is expensive. To manage travel costs, pollsters used stratified polling techniques: where they randomly selected several polling locations, and then randomly visited several households in each location. this meant the ideal of independent random selection was breached. The second person polled was not selected totally at random. While the second person was partly selected at random, in part, they were selected because they lived near the first person polled. It is also worth noting that today the national building stock includes many more high-rise apartments in security buildings than it once did and polling people in these properties is challenging in practice.

When landline telephones became direct-dial and almost ubiquitous (sometime in the 1970s/1980s), pollsters used the pool of all possible telephone numbers or the telephone directory as a proxy for the Australian voting population. This practice was cheaper than face-to-face polling, and for a few decades at least, provided reasonably robust poll results. However, it was not perfect. Not every household had a phone. The probability of selection differed between multi-person and single-person households. People with active social lives or working shift-work were less likely to be polled than homebodies.

Pollsters addressed these problems by dividing their sample into mutually exclusive sub-cohorts (or strata) using criteria that are known to correlate with different propensities to vote in one way or another (such as gender, age, educational attainment, ethnicity, location, and/or income). The pollsters would then weight the various sub-cohorts (or strata) within the sample by their population proportion (using ABS census data). The pollsters argued that these weights would correct for any bias from having (say) too many old people in the response set.

Both polling methods also had issues of non-response. Where a person refused to participate in a survey, the issue would be addressed by sub-cohort weightings. Another form of non-response is when a respondent says they "don't know" who they would vote for. Pollsters assumed these "don't know" respondents would likely vote in the same proportions as those respondents that did have a voting preference. To the extent that the assumption was true, this would reduce the effective sample size and increase the margin of error, which in turn increases the variance associated with making an accurate population estimate. Where the assumption was false, it would bias the population estimate. Either way, "don't know" respondents increase the error associated with the published population estimate.

While such adjustments mostly worked back then, it is important to know, that the more adjustments that are made, the less confidence we can have in the resulting population estimates.

In 2020, national face-to-face polling remains expensive. Telephone polling is much less reliable than it was only 15 years ago. There are several reasons for this (discussed in more detail in my previous post), but in summary:

with the rise of mobile phones and voice-over-internet-protocols (VOIP), far fewer households have traditional landlines
mobile phones are hard to poll randomly as the government does not publish a directory of mobile and VOIP numbers (and mobile and VOIP numbers are not structured to geography in the way landlines once were)
with the rise of annoying robocalls - and the constant requests for phone, SMS and email feedback [please rate the service you received this morning] - the cultural willingness of people to answer surveys has declined substantially
As response rates have declined, the cost of phone surveys have increased

With its inherently higher costs and modern building practice making face-to-face polling challenging, and the declining response rates and rising costs of telephone polling, the pollsters looked for another cost-effective approach. They have turned to non-probability quota sampling and online panel polls.

Non-probability quota sampling

Quota sampling depends on the pollster putting together a sample by design. Typically, such as sample would have the same characteristics as the population. If just over half the population is women, then just over half the quota sample would be women. And so on for a series of other demographic characteristics such as age groups, educational attainment, urban/regional/rural location; etc. Typically pollsters use the latest Australian Bureau of Statistics (ABS) census data to construct the control quotas used in building the sample.

However, sample members are not selected from the population through a random process. Typically they are selected from a larger panel which was constituted by the pollster through (for example) web-based advertising, snowballing (where existing panel members are asked to identify people they know who might want to be on a panel), and other pollster-managed processes. The art (not science) here is for the pollster to construct a sample that looks like it was selected at random from the larger population. Note: even if a quota is constructed at random from a larger panel, it is still not a probability sample of the population.

Quota samples can be contrasted favourably with less robust non-probability sampling techniques such as:

self-selecting samples, where the respondents choose whether to participate or not in a poll, and
convenience samples, where the pollster/interviewer approaches respondents (typically close at hand), without any particular quotas to be met.

But quota sampling, even when done well, is less robust than good quality probability sampling. The sample frame for probability sampling is the entire population. The sample frame for quota sampling is a much smaller panel of respondents maintained by the pollster. Not all of the population has a chance of being polled with quota sampling. Does this make a difference? In my mind, there are two conflicting answers to this question.

Yes, there is an important distinction. By seeking to apply the ideal of a probability sample, pollsters rely on the axioms of mathematics to make a population estimate. With a quota sample, there is no axiomatic relationship between the sample mean and the population mean. Strictly speaking, it is not justifiable to make a population estimate from a survey of a quota sample. While quota sampling may work somewhat in practice, the process is one-step removed from being mathematically grounded, and it is much more susceptible to sample selection biases (which we will discuss further below).

But this might be less important in practice. If (as happened with telephone polling) response rates become very small, and there are no cost-effective methods for implementing a robust probability sample, quota sampling might be a pragmatic alternative to a very low response rate probability-sample polling. But this comes with a price: higher errors in poll estimates.

Nonetheless, quota sampling for internet polling is not without its challenges.

Issues with quota sampling and internet polling

The first thing to note is that, according to Granwal (2019), 88 per cent of the Australian population is an active user of the internet. While this proportion might appear high, it is not 100 per cent. Furthermore, Thomas et. al. (2018), have identified particular cohorts that are under-represented in their internet use:

households with income under $35,000/year (quintile 5)
people whose only access to the internet is with mobile phones
people aged 65 years or more
people who did not complete secondary education
people with a disability
households with income between $35,000 and $60,000/year (quintile 4)
people not in the labour force
Indigenous Australians
people aged between 50 and 64 years

Of necessity, this suggests there will be selection bias issues with any sampling approach that depends on data collection via the internet.

Furthermore, research by Heen, Lieberman and Miethe (2014), suggests that some polling organisations do poorly at constructing samples that reflect sub-cohort proportions in the population against standard demographic characteristics. It is unclear to what extent this might be a problem with election polling in Australia.

Frequent survey responders are another problem with quota sampling from a limited panel. There are a couple of dimensions to this problem. The first is that frequent serial responders may respond differently to other responders (for example, their views may become frozen if they frequently answer the same question, or they may devote less time to question answering over time). The second is that parallel frequent responders may exist because they lie and cheat to be listed multiple times under different aliases on multiple pollsters' panels (Whitsett 2013). Parallel frequent responders may do so for political reasons (to deliberately skew poll reports) or they may do this for the financial rewards associated with survey completion.

In the context of Australian election polling, the existence of frequent responders has been suggested as an explanation for the underdispersion that has been seen at times from pollsters who have adopted quota sampling approaches.

While pollsters have used financial incentives as a method to increase participation rates, it may skew participation among low-income cohorts. It could also lead to a cluster of "professional respondents" in the data - frequent serial respondents who complete surveys for the money. Handorf et. al. (2018) notes that the offering of incentives increases repeat-responders, which in turn reduces the quality of survey data. De Leeuw and Matthisse (2016) note that professional respondents "are more often female, slightly less educated and less often gainfully employed".

Drawing a number of the above themes together, Lee-Young (2016) suggests that pollsters using online quota sampling need to weed out a range of "dishonest" respondents:

professional competition entrants - the frequent responders noted above
survey speeders - who complete the survey in record time (and with insufficient consideration)
straight-liners - who give the same (Likert-scale) response to every question
block-voters - who give the same answer as each other, to advocate through polling

Another list from Gut Check It (2016), suggests pollsters need to watch out for: professionals, rule-breakers, speeders, straight-liners, cheaters (survey bots and multiple alias responders), and posers (caught in social desirability bias).

Because a quota sample is constructed by the pollster rather than randomly selected from the population, these problems are not problems typically faced in probability sampling. Without careful attention to how the quota sample is constructed, pollsters run a significant risk of increased bias.

Conclusion

The question this post explores is whether the shift to non-probability sampling over the past fifteen years or so was a contributing factor to the polling failure before the 2019 Australian Federal Election. The short answer is: quite probably.

But we cannot blame pollsters for this and simply go back to the approaches that worked in the eighties, nineties, and naughties. Back then, it was relatively easy to construct telephone polls with robust probability sampling. However, the rise of mobile phones and VOIP technology, and changing community attitudes have disrupted the pollsters. As a result, thepollsters can no longer construct cheap and workable phone polls and probability frames with high response rates. And by embracing quota sampling, the pollsters have told us they believe that the high-quality polling accuracy we had enjoyed in Australia since the mid-1980s is no longer possible (or perhaps, no longer affordable).

While pollsters are caught between Scylla and Charybdis, we the poll-consuming public need to come to terms with the increased likelihood of more poll misses in the future. Today's polls have more inherent error than we are used to.

Mark the Ballot

Pages

Sunday, March 1, 2020

Is non-probability sampling a problem?

The Central Limit Theorem

Sampling error

The challenge of constructing a robust probability sample

Non-probability quota sampling

Issues with quota sampling and internet polling

Conclusion

Further reading

No comments:

Post a Comment