Saturday, July 18, 2015

Sealing wax, sample sizes and things

The latest Morgan poll for the period July 4/5 and 11/12, 2015 had the Coalition on 49 per cent and Labor on 51 per cent. It was based on an Australia-wide cross-section of 3110 Australian electors. As past data suggested a systemic house bias from Morgan towards Labor of around one percentage point, most aggregators would have treated this data point as something like a 50-50 result.

This Morgan poll has thrown some poll aggregators. Some have moved a percentage point or more in the Coalition's favour in a week (in what I thought was a fairly average week in politics, in as far as it was likely to influence voting behaviour). The only other poll out this week was Essential, which was unchanged on last week's rolling fortnight result (48-52 in the Coalition's favour).

On 6 July, the Phantom Trend had the Coalition on 47.71. Come 13 July, that was 49.36 (an improvement for the Coalition of 1.65 percentage points).

On 9 July, the Poll Bludger had the Coalition on 47.8. Come 16 July, the Poll Bludger had the Coalition on 49.0 per cent (an improvement of 1.2 percentage points in a week).

More subdued is Kevin Bonham, who has moved from 47.5 percent to 48.1 (a movement of 0.6 percentage points).

While I typically only update my aggregation on a fortnightly basis this far out from an election, my own movement over the week was from 47.4 to 48.0 (also a movement of 0.6 percentage points). But note, I apply an adjustment to the Morgan poll sample size in my aggregation. I usually treat Morgan polls as if they came from a sample of 1000 voters. If I turn this feature off, my aggregate poll result would have been 48.7 (a movement of 1.3 percentage points).

Anyway, all of this got me wondering about how I treat poll variance in my models.Variance is a measure of how much a statistic is likely to bounce around.

I decided to check whether the poll bounciness from each polling house was consistent with the typically reported sample sizes for the polling house. If I treat the smoothed aggregation as the population voting intention at any point in time, I can compare the performance of systemic-bias-adjusted poll results against this voting intention. I can then derive a notional sample size, based on the bounciness of the poll results, for each of the polling houses.

Algebraically,  there are two relevant formulas for statistical variance (you will need javascript activated to see these formulas in your web browser using MathJax):
$Variance=\frac{1}{n}\sum(x-\mu)^2$
$Variance=\frac{pq}{N}$

Where $$n$$ is the number of polls for the polling house, $$x$$ is the systemic-bias-adjusted poll results, $$\mu$$ is the population voting intention at the time (proxied by a 91-term Henderson-smoothed poll aggregation), $$p$$ is the two-party preferred Coalition proportion (averaged for the period from the pollster's systemic-bias-adjusted results), $$q$$ is the two-party preferred Labor proportion (where $$q=1-p$$), and $$N$$ is the sample size. We can solve for $$N$$ by solving the first equation and then applying that result to the second equation after rearranging terms.
$N=\frac{pq}{Variance}$

The results, using the polling data since the last election, follow. Arguably, the number of polls for all but Morgan and Newspoll, are too few in number to sensibly analyze. The analysis suggests that the Morgan poll is a touch more bouncy (it has a higher variance), than a typical Morgan sample of around 2900 respondents would suggest. It looks like Nielsen had an unlucky run of outlier polls (not surprising given we are only talking about a small number of polls). This run affected Nielsen's result, but with only 7 polls, the analysis should be treated with some caution.

House # polls (n) Implied Sample Size (N) Typically Reported Sample Size
Galaxy 9 1275.94 1400
Ipsos 8 1283.48 1400
Morgan 47 2416.09 2900
Newspoll 34 1235.65 1150
Newspoll2 1 - 1600
Nielsen 7 668.01 1400
ReachTEL 21 3642.97 3400

By-the-way, so that Morgan was not disadvantaged in this exercise, I turned off the 1000 sample size adjustment I usually apply to Morgan. The smoothed poll aggregate was derived using the actual Morgan sample sizes as reported.

A chart of the 91-day smoothed aggregate follows. I used a smoothed aggregate so that individually anomalous poll results would not overly sway the calculation of population voting intention on a day.

The other question that interested me was whether these poll results were normally distributed, or whether they had "fat" tails. To assess this, I performed a kernel density estimation. For this I used the Python statsmodels package.

In the following charts, using the derived variance above for the normal curve, and limiting our analysis to the two polling houses where we have enough data, it would appear that our poll results are close to normally distributed. But I cannot rule out the possibility of fatter tails. In plain English, it would appear that pollsters are perhaps a touch more-likely to throw an outlier-poll compared with what you would expect from statistical theory.

Kernel density estimates are designed so that the area under the curve sums to one. This allows for the probability of a particular outcome to be seen. In the above charts, the x-axis is in percentage points; for example, 2 percentage points is 0.02 (just in case you were wondering about the relative scales of the x and y axes).

I can combine the poll results from all houses as standardised z-scores, using the variance calculated above. In this combined chart of all poll results, the "fatter tails" are more evident. The z-scores are the differences divided by the standard deviation. The standard deviation ($$\sigma$$) is the square root of the variance ($$\sigma^2$$).
$Standard Deviaition = \sqrt{Variance}; \quad \sigma = \sqrt{\sigma^2}$

Because we are combining a number of small sample sizes, there is a reasonable likelihood this analysis is problematic.

The fatter the tail, the more likely that a polling aggregate will be sent off course from time to time. Perhaps this week was one of those times. As is so often the case, we will need to wait and see.

Update

I have been asked for quantile-quantile plots (affectionately known as QQ plots) to further explore the extent to which we have a normal distribution in the distribution of bias-adjusted poll results.