Wednesday, July 31, 2013

Further explorations in non-linearity

On the weekend I began exploring a non-linear model for aggregating polling. My test case produced nice looking graphs; but the results were in some large part an artifact of the priors I had chosen for the model (not good).

In the comments to that post, I suggested that part of the problem may have been how I had defined the model.

Thinking about this some more, I think the problem was in this statement: << beta-for-pollster * (poll-result - minimum-poll-result) >>. Rather than the minimum, it should be a central tendency of some type (mid-point, mean, median, etc.).

I have now redefined the model to use the midpoint for each polling house (defined as the (min+max)/2 for that house). The use of house-specific mid-points acknowledges that the raw x scores include the alpha house effect I am trying to estimate. The revised model is:

house-effect-for-pollster = alpha-for-pollster + beta-for-pollster * (poll-result - pollster-mid-point)

The priors to the beta effect have been completely freed-up, so that they are uninformative. As a result, this model works better internally than the previous model.

However, I have added a constraint such that the aggregation assumes the "fizziness" factors also sum to zero. This may be a bollocks assumption; and will need some further analysis.

So - with the caveat that this is still very early exploratory analysis - the results of this new model follow.

On a plain reading, the above charts suggest that the current Rudd effect may actually be better for Labor than other aggregations have found. However, I am not convinced this is anything more than an artifact of the model. I am particularly concerned about the inclusion of data from polling houses that do not have data points across the highs and lows of the Gillard period, and across the Rudd and Gillard periods. Such data don't fully occupy the model, which can result in artifacts.

If we limit the analysis to data from Nielsen and Newspoll for a comparison using data that does fully occupy the model, we get the following.

Because I am committed to presenting poll analysis in a transparent and unbiased fashion, the next two code snippets show how I package up the data for the model (in R), and the model itself (in JAGS).

    # - prepare the data ...
data$y <- data[ , y] / 100 # vote share between 0-1 data$variance <- data$y * (1 - data$y) / data$Sample #pq/n data$standardError <- sqrt(data$variance) data$samplePrecision <- 1 / data$variance HOUSES <- levels(data$House)
HOUSECOUNT <- length(levels(data$House)) NUMPOLLS <- length(data$y)
cat('Number of polls: '); print(NUMPOLLS)
midpoints <- rep(NA, HOUSECOUNT)
for(i in seq_len(HOUSECOUNT)) {
midpoints[i] <- (max(data[data$House==HOUSES[i], 'y']) + min(data[data$House==HOUSES[i], 'y'])) / 2
cat('Midpoints: '); cat(i); cat(' '); cat(HOUSES[i]); cat(': '); print(midpoints[i])
}

# - remember these dots
dotsFile <- paste('./files/', fPrefix, 'original.csv', sep='')
write.csv(data, file=dotsFile)

# - manage dates
day0 <- min(data$Date) - 1 # walk starts from earliest date if(endDate == TODAY) endDate <- max(data$Date)
PERIOD <- as.numeric(endDate - day0) # length of walk in days
DISCOUNTINUITYDAY <- as.numeric(as.Date(discontinuity) - day0)
cat('Discontinuity: '); print(DISCOUNTINUITYDAY)
tPrefix <- paste(format(day0+1, '%d-%b-%Y'), ' to ', format(endDate, '%d-%b-%Y'), sep='')
data$day <- as.numeric(data$Date - day0)

# - do the MCMC thing ...

parameters <- list(PERIOD = PERIOD,
HOUSECOUNT = HOUSECOUNT,
NUMPOLLS = NUMPOLLS,
DISCOUNTINUITYDAY = DISCOUNTINUITYDAY,
NEWSPOLL = which(levels(data$House) == 'Newspoll'), y = data$y,
x = data$y, day = data$day,        house = as.integer(data$House), samplePrecision = data$samplePrecision,
midpoints=midpoints
)

jags <- jags.model(textConnection(model),
data=parameters,
n.chains=4,
)

# - burn in
update(jags, n.iter=n.update) # burn-in the chains

# - capture results
jags.capture <- c('walk', 'alpha', 'beta', 'discontinuityValue')
coda.samples <- coda.samples(jags, jags.capture, n.iter=n.iter, thin=n.thin)
coda.matrix <- as.matrix(coda.samples)

model {
## Derived from Simon Jackman's original model

## -- observational model
for(poll in 1:NUMPOLLS) {
# note: x and y are the original polling series
houseEffect[poll] <- alpha[house[poll]] +
beta[house[poll]]*(x[poll]-midpoints[house[poll]])
mu[poll] <- walk[day[poll]] + houseEffect[poll]
y[poll] ~ dnorm(mu[poll], samplePrecision[poll])
}

## -- temporal model
for(i in 2:PERIOD) { # for each day under analysis ...
walk[i-1]+discontinuityValue, walk[i-1])
}
sigmaWalk ~ dunif(0, 0.01)            ## uniform prior on std. dev.
walkPrecision <- pow(sigmaWalk, -2)   ##   for the day-to-day random walk
walk[1] ~ dunif(0.4, 0.6)             ## uninformative prior
discontinuityValue ~ dunif(-0.2, 0.2) ## uninformative prior

## -- house effects model
for(i in 2:HOUSECOUNT) { ## vague normal priors for house effects
alpha[i] ~ dunif(-0.1,0.1)
beta[i] ~ dunif(-1,1)
}
alpha[NEWSPOLL] <- -sum(alpha[2:HOUSECOUNT]) ## sum to zero
beta[NEWSPOLL] <- -sum(beta[2:HOUSECOUNT]) ## sum to zero
}

1 comment:

1. Is there any reason that you are defining the bias as:

"house-effect-for-pollster = alpha-for-pollster + beta-for-pollster * (poll-result - pollster-mid-point)"