Sunday, March 11, 2018

What do you do when the data does not fit the model?


I have a couple of models in development for aggregating primary vote intention. I thought I would extend the centred logit model from four parties (Coalition, Labor, Greens and Other) to five parties (Coalition, Labor, Greens, One Nation and Other). It was not a significant change to the model. However, when I went to run it, it generated pages of warning messages, well into the warm-up cycle:
The current Metropolis proposal is about to be rejected because of the following issue:[boring technical reason deleted]
If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine, but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
My model is highly constrained, and it would produce a few of these warnings at start-up. But it was nothing like the number now being produced. Fortunately, the model was not completely broken, and it did produce an estimate.

When I looked at the output from the model, the issue was immediately evident: the Ipsos data was inconsistent with the assumptions in my model. In this chart, the first Ipsos poll result for One Nation is close to the results from the other pollsters. But the remaining three polls are well below the estimate from the other pollsters. This difference delivers a large house bias estimate for Ipsos, well to the edge of my prior for house biases: normal(0, 0.015) - ie. a normal distribution with a mean of 0 and a standard deviation of 1.5 per percentage points.



The model assumes that each polling house will track around some hidden population trend, with each poll's deviation from that trend being explained by two factors: a random amount associated with the sampling variation, and a constant additive amount associated with the systemic house bias for that pollster.

If we assume the other pollsters are close to the trend, then a few explanations are possible (listed in my ranking of higher to lower plausibility):
  • Ipsos changed its polling methodology between the first and subsequent polls, and the model's assumption of a constant additive systemic house bias is incorrect; or
  • the first poll is a huge outlier for Ipsos (well outside of the typical variation associated with the sampling variation from each poll), and the remaining three polls are more typical and incorporate a substantial house bias for Ipsos; or
  • the first Ipsos poll is more plausible for Ipsos, and the last three are huge outliers.
Whatever, it makes the resolution of an estimate more challenging for Stan. And it gives me something more to ponder.

I should note that while the Ipsos results are low for One Nation, they are consistently above average for the Greens. Being consistently above is in line with model assumptions.



There is some allure in removing the Ipsos data from the model; but there be dragons. As a general rule of thumb, removing outliers from the data requires a compelling justification, such as independent verification of a measurement error. I like to think of outliers as a disturbance in the data where I don't yet know/understand what causes that disturbance.

No comments:

Post a Comment