Wednesday, April 25, 2018

Simple poll aggregation models

I have played some more with my simplified poll aggregation models using Henderson moving averages (HMA) and Locally Weighted Scatter-plot Smoothing (LOWESS). I fit the smoothed curve taking account of house biases (that is, the systemic tendency for a pollster to favour one house or the other). The models are fit iteratively, with the objective of adjusting the estimated house effects to minimize the sum of the errors squared for the fit. I have also compared these models with the Hierarchical Bayesian model I use.

Cutting to the chase: the plot of the various aggregation models follows. Just under the plot heading you can see the end-point for each of the models (as at 22 April 2018). As usual, the input data for these models comes from the Wikipedia page on the Next Australian Federal Election.

The estimated house biases (the first six columns expressed in pro-Coalition percentage points) for each model (the rows) are in the next table. Note, these are relative house effects as the rows have been constrained to sum to zero. The "Iter" column is the number of iterations taken to produce this estimate. The "Sum Errors Squared" column is the sum of the errors squared, noting that within the model these are calculated from proportions (between 0 and 1) and not percentage points (between 0 and 100).

Essential Ipsos Newspoll ReachTEL Roy Morgan YouGov Iter Sum Errors Squared
HMA-181 -0.845100 -0.568599 -0.615695 -0.408452 0.278507 2.159338 13 0.008061
HMA-365 -0.832409 -0.485705 -0.589923 -0.401716 0.150488 2.159266 12 0.008920
LOWESS-91 -0.818754 -0.554349 -0.604249 -0.403321 0.216127 2.164546 13 0.008110
LOWESS-181 -0.826693 -0.475102 -0.577678 -0.413161 0.169364 2.123270 12 0.009222

This compares well with the Hierarchical Bayesian model:

The updated code follows.
# PYHTON: iterative data fusion: 
#   using Henderson Moving Averages (HMA)
#   and Locally Weighted Scatterplot Smoothing (LOWESS)
#   with a sum-to-zero adjustment for House biases 

import pandas as pd
import numpy as np
from numpy import dot
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import statsmodels.api as sm
lowess = sm.nonparametric.lowess

import sys
sys.path.append( '../bin' )
from mg_plot import mg_min_max_end
from Henderson import Henderson'../bin/markgraph.mplstyle')

# --- key constants
HMA_PERIODS = [181, 365] # days
LOWESS_PERIODS = [91, 181] # days

graph_dir = './Graphs/'
graph_leader = 'FUSION-'
intermediate_data_dir = "./Intermediate/"

# --- Functions
def note_house_effects(effects, Houses, mode, period, 
    iter_count, current_sum):
    ''' For each iteration we record the results. This function compiles
        the results into a single row DataFrame, which will be appended
        to the iteration history DataFrame
        effects:    is a column vector of house effects that was applied
        Houses:     is a list of Houses
        mode:       is a mode in 'HMA' or 'LOWESS'
        iter_count: is the iteration count as an integer
        current_sum: is the error squared sum as a float
        returns:    a Pandas DataFrame with one row
    house_effects = pd.DataFrame([effects.T[0]], 
        columns=Houses, index=[iter_count])
    house_effects['Iterations'] = iter_count
    house_effects['Model'] = '{}-{}'.format(mode, period)
    house_effects['Error Sq Sum'] = [current_sum]
    house_effects['Effects Sq Sum'] = dot(effects.T, effects)[0]
    return house_effects

def interpolate_and_smooth(x):
    ''' interpolate missing data in a series then apply 
        a 21-term moving average - with smaller moving averages 
        at the end of the series.
        x:          a Pandas Series
        returns:    a Pandas Series
    # set up smoothers - note centre weighting in 7- and 21-term MA
    smoother21 = np.array([1,1,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,2,2,1,1])
    smoother21 = smoother21 / np.sum(smoother21)
    smoother7 = np.array([1,2,3,3,3,2,1]) # 7-term MA
    smoother7 = smoother7 / np.sum(smoother7)
    smoother3 = np.array([1,1,1])
    smoother3 = smoother3 / np.sum(smoother3)
    x = x.interpolate()
    # calculate smoothers
    s21 = x.rolling(window=len(smoother21), min_periods=len(smoother21), 
        center=True).apply(func=lambda x: (x * smoother21).sum()) 
    s7 = x.rolling(window=len(smoother7), min_periods=len(smoother7), 
        center=True).apply(func=lambda x: (x * smoother7).sum()) 
    s3 = x.rolling(window=len(smoother3), min_periods=len(smoother3), 
        center=True).apply(func=lambda x: (x * smoother3).sum()) 
    # fix the end data from smoothed to less smoothed to unsmoothed 
    s = s7.where(s21.isnull(), other=s21) 
    s = s3.where(s.isnull(), other=s) 
    s = x.where(s.isnull(), other=s) 

def estimate_hidden_states(ydf, mode, period, n_days):
    ''' This function takes the house-effect adjusted y values and 
        estimates a hidden vote share for each day under analysis using 
        moving averages to give a smooth result.
        ydf:        is a DataFram of y values, 
                    with cols: Day and adjusted_y
        mode:       is a MODALITY string - either 'HMA' or 'LOWESS'
        period:     in days - the span for the moving average
        returns:    a pandas Series indexed by days

    # --- plot known data points and interpolate the in-between days
    #     where more than one poll on a day, average those polls.
    hidden_state = pd.Series(np.array([np.nan] * n_days))
    for day in ydf['Day'].unique():
        result = ydf[ydf['Day'] == day]
        hidden_state[day] = result['adjusted_y'].mean()
    # --- apply the HMA or LOWESS smoothing 
    if mode == 'HMA':
        # HMA requires data points for all x values
        # therefore we interpolate first
        hidden_state = interpolate_and_smooth(hidden_state)
        hidden_state = Henderson(hidden_state, period)
    elif mode == 'LOWESS':
        # LOWESS does not require data points for all x values
        hidden_state = pd.Series(
            lowess(hidden_state, hidden_state.index, frac=period/n_days, 
        hidden_state = interpolate_and_smooth(hidden_state)
    return hidden_state

def calculate_house_effects(Houses, ydf, hidden_state):
    ''' For a curve generated by the estimate_hidden_states function, 
        calculate the zero-sum constrained house effects for each pollster
        Houses:     is a list of pollsters
        ydf:        is a pandas DataFrame of y values/attributes, 
                    with columns y, Day and Firm
        returns:    a column vector of house effects
    new_effects = []
    for h in Houses:
        count = 0
        sum = 0
        result = ydf[ydf['Firm'] == h]
        for index, row in result.iterrows():
            sum += hidden_state[row['Day']] - row['y']
            count += 1.0
        new_effects.append(sum / count)
    new_effects = pd.Series(new_effects)
    new_effects = new_effects - new_effects.mean() # sum to zero
    effects = new_effects.values
    effects.shape = len(effects), 1 # it's a column vector
    return effects

def sum_error_squared(hidden_state, ydf):
    ''' For a curve generated by the estimate_hidden_states function, 
        calculate the sum of the errors-squared for the y observations
        hidden_state: a pandas Series indexed by days
        ydf:        is a pandas DataFrame of y values/attributes, 
                    with columns Day and adjusted_y
        returns:    a float
    dayValue = hidden_state[ydf['Day']]
    dayValue.index = ydf.index
    e_row = (dayValue - ydf['adjusted_y']).values # row vector
    return dot(e_row, e_row.T)

def get_minima(history):
    ''' Return the minimum sum of errors-squared from the iteration 
        history DataFrane
        history:    pandas DataFrame of all iterations to now
        returns:    minimum value for the sum of errors squared
    return history['Error Sq Sum'].min()

def get_details(search, history, Houses):
    ''' Find the details in the iteration history DataFrame for 
        a specific search value. The search term is the value of 
        sum errors squared being sought (typically a minimum)
        search:     the value of sum errors squared being sought
        history:    pandas DataFrame of all iterations to now
        Houses:     is a list of pollsters
        returns:    (iter_num, effects) - found selected effects in history
    effects = history[history['Error Sq Sum'] == search][Houses].T.values
    effects.shape = (len(effects), 1) # effects is a column vector
    iter_num = history[history['Error Sq Sum'] == search]['Iterations']
    return (iter_num, effects)

def curve_fit(Houses, H, mode, period, ydf, n_days):
    ''' Iteratively fit curves to the data, then adjust the data to
        better reflect the house effects for each pollster. Stop when
        the changes being made become minimal. 
        Houses:     is a list of Houses
        H:          is a House Effects dummy var matrix
        mode:       is a MODALITY string - either 'HMA' or 'LOWESS' - 
                    which reflects the type of curve we will fit
        period:     in days - the span for the moving average
        ydf:        pandas DataFrame of y variables with cols 'y', 'Day' 'Firm'
        n_days:     number of days under analysis
        returns:    (iter_count, history, y)
    # --- initialisation regardless of mode 
    effects = np.zeros(shape=(len(Houses), 1)) # start at zero
    history = pd.DataFrame()
    previous_sum = np.inf
    y = ydf['y'].values
    y.shape = (len(y), 1) # column vector
    iter_count = 0;
    # --- iterative fitting process ...
    #   note: this is only a quick and dirty approximation
    print('--> About to iterate: {}-{}'.format(mode, period))
    while True: 
        iter_count += 1

        # --- calculate new hidden states, 
        #     update estimate of house effects 
        #     and calculate error squared
        ydf['adjusted_y'] = y + dot(H, effects) # matrix arithmetic 
        hidden_state = estimate_hidden_states(ydf, mode, period, n_days)
        effects = calculate_house_effects(Houses, ydf, hidden_state)
        current_sum = sum_error_squared(hidden_state, ydf)
        if iter_count > 1:
            minima = get_minima(history)
            minima = np.inf
        # Note: minima does not include current_sum 

        # --- remember where we have been - puts current_sum into history
        house_effects = note_house_effects(effects, Houses, 
            mode, period, iter_count, current_sum)
        history = history.append(house_effects)
        print('--\n', house_effects)
        # --- exit when we are no longer making much difference
        margin = 0.000000000001
        if np.abs(current_sum - minima) < margin or np.abs(
                current_sum - previous_sum) < margin:
            # near enough to a minima
        # --- end loop tidy-ups
        previous_sum = current_sum

    # --- exit
    return (iter_count, history, y)

def chart(sdf=None, title=None, y_label=None, annotation='', file_prefix=None):    
    ax = sdf.plot()
    ax.text(0.01, 0.99, annotation,
        ha='left', va='top', fontsize='xx-small', 
        color='#333333', transform = ax.transAxes) 

    fig = ax.figure
    fig.set_size_inches(8, 4)
    fig.text(0.99, 0.01, '',
        ha='right', va='bottom', fontsize='x-small', 
        fontstyle='italic', color='#999999') 
    fig.savefig(graph_dir+file_prefix+'.png', dpi=125) 
    return (fig, ax)

# --- collect the model data
# the XL data file was extracted from the Wikipedia
# page on the next Australian Federal Election
workbook = pd.ExcelFile('./Data/poll-data.xlsx')
df = workbook.parse('Data')

# drop pre-2016 election data
df['MidDate'] = [pd.Period(date, freq='D') for date in df['MidDate']]
df = df[df['MidDate'] > pd.Period('2016-07-04', freq='D')] 

# convert dates to days from start
start = df['MidDate'].min()  # day zero
df['Day'] = df['MidDate'] - start # day number for each poll
n_days = df['Day'].max() + 1

# treat later Newspoll as a seperate series 
# for change to one nation treatment
df['Firm'] = df['Firm'].where((df['MidDate'] < pd.Period('2017-12-01', freq='D')) |
    (df['Firm'] != 'Newspoll'), other='Newspoll2')

df = df.sort_values(by='Day')
df.index = range(len(df)) # reindex, just to be sure

# --- do for a number of different HMAs and LOWESS functions
Adjustments = pd.DataFrame()
Hidden_States = pd.DataFrame()

for mode in MODALITIES:

    if mode == 'HMA':
    elif mode == 'LOWESS':

    for period in PERIODS:

        # --- initialisation - in preparation to fit iteratively
        ydf = pd.DataFrame([df['TPP L/NP'] / 100.0, df['Day'], df['Firm']], 
            index=['y', 'Day', 'Firm']).T
        H = pd.get_dummies(df['Firm']) # House Effects dummy var matrix
        Houses = H.columns
        H = H.as_matrix()
        # --- undertake the analysis ...
        (iter_count, history, y) = curve_fit(Houses, H, mode, period, ydf, n_days)
        # --- record the minima generating house effects 
        minima = get_minima(history)
        (iter_num, effects) = get_details(minima, history, Houses)
        ydf['adjusted_y'] = y + dot(H, effects)
        hidden_state = estimate_hidden_states(ydf, mode, period, n_days)
        Hidden_States['{}-{}'.format(mode, period)] = hidden_state
        sum = sum_error_squared(hidden_state, ydf)
        effects = calculate_house_effects(Houses, ydf, hidden_state)
        house_effects = note_house_effects(effects, Houses, 
            mode, period, iter_count, sum)
        Adjustments = Adjustments.append(house_effects)
        # get back to something useful
        hidden_state *= 100.0
        hidden_state.index = [(start + x).to_timestamp().date() 
        for x in hidden_state.index]
        fax = chart(sdf=hidden_state, 
            title='{}-{}: Coalition TPP Vote Share'.format(mode, period), 
            y_label='Percent Coalition TPP Vote', 
            file_prefix=graph_leader+'{}-{}'.format(mode, period))

        fax = mg_min_max_end(series=hidden_state, fax=fax,
            '{}-{} !annotated.png'.format(mode, period))
        print('\n-- FOUND --\n', house_effects, '\n-----------\n')

# --- get an Adjustments summary in pro-Coalition percentage points
Adjustments.index = Adjustments['Model']
AdjustmentsX = Adjustments[Houses] * -100 # Note: bias = -treatment * 100%
#AdjustmentsX['Total'] = AdjustmentsX.sum(axis=1)
AdjustmentsX['Iter'] = Adjustments['Iterations']
AdjustmentsX['Sum Errors Squared'] = Adjustments['Error Sq Sum']

# --- Plot
Hidden_States *= 100.0 # from proportions back to percent
Bayes_TPP = pd.read_csv(intermediate_data_dir+
    header=0, index_col=0, quotechar='"', sep=',', 
    na_values = ['na', '-', '.', '']) 
Hidden_States['Bayes'] = Bayes_TPP['median']
Hidden_States.index = [(start + x).to_timestamp().date() 
        for x in Hidden_States.index]

# allow us to annotate the end points
endpoints = Hidden_States[-1:].copy().round(1)
endpoints = 'Endpoints: ' + '; '.join([x+': '+str(y)+'%' 
    for x,y in zip(endpoints.columns, endpoints[0:1].values[0])])

# and plot ...
fig, ax = chart(sdf=Hidden_States, title='Coalition TPP Poll Aggregates [Sum(HE)==0]', 
    y_label='Percent Coalition TPP Vote', annotation=endpoints, 

combination = ', '.join(Hidden_States.columns.values)
Hidden_States['Average'] = Hidden_States.mean(axis=1)

fax = chart(sdf=Hidden_States['Average'], 
    title='Combined Coalition Aggregate TPP Vote Share', 
    y_label='Percent Coalition TPP Vote', annotation=combination, 
    file_prefix=graph_leader+'!Comparative !combined')

fax = mg_min_max_end(series=Hidden_States['Average'], fax=fax,
    filename=graph_dir+graph_leader+'!Comparative !combined !annotated.png')

No comments:

Post a Comment