Sunday, November 6, 2016

How are seat swings distributed around the national swing?

If we take all nine Federal elections since 1993 (that is the Federal elections in 1993, 1996, 1998, 2001, 2004, 2007, 2010, 2013 and 2016) we can look at the distribution of individual seat swings around the national swing for each election. This information is useful for Monte Carlo simulations of future elections. The summary statistics for this analysis follow.

    count    1342.000000
    mean       -0.014416
    std         3.224886
    min       -14.250000
    25%        -2.067500
    50%        -0.080000
    75%         1.990000
    max        17.350000

Let's start with a normal distribution with a mean of -0.014 and a standard deviation of 3.22 percentage points (as suggested by the raw data). It yields a plot as follows (where the normal curve is fitted to the data in red).


However, there is a potential problem with this normal distribution. There are too many outliers for a normal distribution. In the area between -3 and +3 standard deviations from the mean we would expect to see 99.73 per cent of all observations if the data was normally distributed. We would only expect to see 0.27 per cent of the observations outside of this range. We actually have 0.82 per cent of our observations outside of this range.

We also have more observations bunched in the middle of the distribution. For a normal distribution, you would expect to find 68.27 per cent of the observations between -1 and +1 standard deviations from the mean. We have 70.19 per cent of our observations in the middle of the distribution.

Not surprisingly therefore, the excess kurtosis statistic for the observations is positive: 1.15 (using Fisher’s definition of kurtosis, where the normal curve has a kurtosis of zero). From our sample, it would appear that the distribution of seat swings around the national swing is a touch leptokurtic. More observations than normal are clustered around the mean, but there is also a higher probability than normal of substantial outliers occurring from time to time (which is sometimes referred to as a fat tail risk distortion).

Given the fat tails, I tried a couple of other distributions to see if they would provide a better fit than the Gaussian normal distribution. My attempt to fit a Cauchy distribution - see next chart - yielded a poorer fit (as measured by the sum of squared errors). 


However, Student's t-distribution (next chart) was an improvement on the normal distribution in terms of fit. When it comes to a Monte Carlo simulation of election outcomes for individual seats around a national swing, the t-distribution looks the most promising. The parameters for this t-distribution are: df=10.5318772642; loc= -0.0531408340653; and scale=2.89669817523. I suspect it is simply an artifact of the raw data that the location parameter is not zero.


The relationship between the t and normal distributions can be seen in the next chart.


It is worth considering the outliers that lie beyond three standard deviations from the mean. The complete list follows. The critical column is the Adjusted Swing column, which is the swing for the seats minus the national swing (to Labor). There are interesting stories with many of these unusually large swings.

             Division StateAb   Swing  AdjSwing
1993-56       Calwell     Vic   11.37      9.83
1993-80   Maribyrnong     Vic   11.51      9.97
1993-87         Wills     Vic   12.27     10.73
1996-108        Oxley     Qld  -19.31    -14.25
1996-86         Wills     Vic   12.29     17.35
1998-112     Wide Bay     Qld   15.32     10.71
2010-53        Fowler     NSW  -13.81    -11.23
2010-82      Kingston      SA    9.49     12.07
2013-53        Fowler     NSW    8.04     11.65
2013-92         Lyons     TAS  -13.51     -9.90
2016-21          Burt      WA    13.2     10.07

The python code for this analysis is pretty rough. A fair bit of effort went into munging the raw data from the Australian Electoral Commission (AEC) into something that I could work with. The data format changed a number of times over the years. Swings prior to the 2004 election were always reported from the perspective of the Labor party. From the 2004 election, they are reported from the perspective of the government of the day (going into the election). State abbreviations were not used in the earlier data files from the AEC. There is also quite a bit of arcane code dedicated to statistical tests.

import pandas as pd
import numpy as np
from pandas import Series, DataFrame
import matplotlib.pyplot as plt
from scipy import stats  

plt.style.use('../bin/markgraph.mplstyle')

# --- get data

# - a quick and dirty labeling function
def state_label(df):
    states = ['New South Wales', 'Victoria', 'Queensland', 'Western Australia', 
        'South Australia', 'Tasmania', 'Australian Capital Territory', 
        'Northern Territory'] # note: state order is important
    state_ab = ['NSW', 'Vic', 'Qld', 'WA', 'SA', 'Tas', 'ACT', 'NT']

    df['StateAb'] = None
    df.iloc[0, 0] = 'New South Wales'
    for name, ab in zip(states, state_ab):
        position = df[df['Division'] == name].index[0]
        df.loc[position:, 'StateAb'] = ab

    return(df)

# - get the very old data - from ugly tab separated files
# Note: some seriously ugly munging occurring here
# Note: edited the 2001 data file to remove trailing tabs
# Note: Swings in this series are from the perspective of the Labor Party
# Note: the National Swing in the 2001 data file appears incorrect
d = {} # dictionary of divisions
old_years =      ['1993', '1996', '1998', '2001']
old_nat_swings = [1.54,   -5.06,  4.61,   -1.93] # positive = to Labor
start_rows = [17, 17, 19, 19]
previous = None

for year, start, swing in zip(old_years, start_rows, old_nat_swings):
    print('Loading: ' + year + ' ================')

    # get raw data
    d[year] = pd.read_table('./Data/' + year + '-V1_9.TXT', header=start, 
        index_col=None, quotechar='"', sep='\t', na_values = ['na', '-', '.', ''])
    
    # munge the data into shape
    d[year] = state_label(d[year])
    d[year] = d[year].dropna(axis=1, how='all') # drop empty columns
    d[year] = d[year].dropna() # drop empty rows
    d[year] = d[year][~d[year]['Division'].str.contains(' Total')] # drop state totals
    
    # some swing checks ...
    current = (d[year]['ALP'].astype(float).sum() /  
        d[year]['Total'].astype(float).sum() * 100.0)
    if previous :
        print('Swing: ', current - previous)
    previous = current
    
    # get relative national swing for each seat
    d[year]['AdjSwing'] = d[year]['Swing'].astype(float) - swing # positive Labor
    d[year] = d[year][['Division', 'StateAb', 'Swing', 'AdjSwing']]
    d[year].index = year + '-' + pd.Series(range(len(d[year]))).astype(str)     

# - get the more recent data - in CSV files
# Note: Swings in this series are from the perspective of the govt. of the day
new_years =      ['2004',  '2007',  '2010',  '2013',  '2016']
new_nat_swings = [-1.79,   5.44,    -2.58,   -3.61,   3.13] # positive = to Labor
new_suffix =     ['12246', '13745', '15508', '17496', '20499']
reversables =    [True,    True,    False,   False,   True]

for year, suffix, swing, r in zip(new_years, new_suffix, new_nat_swings, reversables):
    print('Loading: ' + year + ' ================')
    
    # get the raw data
    d[year] = pd.read_csv('./Data/HouseTppByDivisionDownload-'+suffix+'.csv', 
    header=1, index_col=None, quotechar='"',sep=',', na_values = ['na', '-', '.', ''])
    
    # munge the data into shape
    d[year].rename(columns={'DivisionNm':'Division'}, inplace=True)
    if r:
        d[year]['Swing'] = -d[year]['Swing'] # from the perspective of swing to Labor
        
    # some swing checks ...
    current = (d[year]['Australian Labor Party Votes'].astype(float).sum() /  
        d[year]['TotalVotes'].astype(float).sum() * 100.0)
    if previous :
        print('Swing: ', current - previous)
    previous = current
    
    # get relative national swing for each seat
    d[year]['AdjSwing'] = d[year]['Swing'].astype(float) - swing # positive = Labor bias
    d[year] = d[year][['Division', 'StateAb', 'Swing', 'AdjSwing']]
    d[year].index = year + '-' + pd.Series(range(len(d[year]))).astype(str) # re-index

# - combine the DataFrames
combined = pd.DataFrame()
for year in sorted(d) :
    combined = combined.append(d[year])

# --- some summary statistics
print(combined['AdjSwing'].describe())
print('Kurtosis: ', combined['AdjSwing'].kurtosis())

# - fit a normal distribution
print('Fit a distribution ================')
# let's histogram the original data
BINS = 100
y, x = np.histogram(combined['AdjSwing'], bins=BINS, normed=True)
x = (x + np.roll(x, -1))[:-1] / 2.0

# let's try the normal distribution
mean, stdev = stats.norm.fit(combined['AdjSwing'])
pdf_gauss = stats.norm.pdf(x, mean, stdev)
sse_gauss = np.sum(np.power(y - pdf_gauss, 2.0)) # sum squared error
print('SSE for normal PDF: ', sse_gauss)

# let's try the Cauchy distribution
loc, scale = stats.cauchy.fit(combined['AdjSwing'])
pdf_cauchy = stats.cauchy.pdf(x, loc, scale)
sse_cauchy = np.sum(np.power(y - pdf_cauchy, 2.0)) # sum squared error
print('SSE for Cauchy PDF: ', sse_cauchy)

# let's try the t distribution
shape, loc, scale = stats.t.fit(combined['AdjSwing'])
pdf_t = stats.t.pdf(x, shape, loc, scale)
sse_t = np.sum(np.power(y - pdf_t, 2.0)) # sum squared error
print('SSE for t PDF: ', sse_t)
print('t parameters df, loc, scale: ', shape, loc, scale)

# - a quick look at the outliers
print('A quick look at outliers ================')
outliers = combined[combined['AdjSwing'].abs() > stdev * 3]
proportion = len(outliers) / len(combined) 
print(outliers)
print("3 * stdev outlier proportion: ", proportion, 
    '; which compares with a theoretical 0.0027')
outliers = combined[combined['AdjSwing'].abs() > stdev * 2]
proportion = len(outliers) / len(combined) 
print("2 * stdev outlier proportion: ", proportion, 
    '; which compares with a theoretical 0.0455')
outliers = combined[combined['AdjSwing'].abs() > stdev]
proportion = len(outliers) / len(combined) 
print("1 * stdev outlier proportion: ", proportion, 
    '; which compares with a theoretical 0.3137')

# --- and plot ...

# - plot data fitted to normal 
ax = combined['AdjSwing'].hist(bins=BINS, normed=True, color='cornflowerblue')
ax.plot(x, pdf_gauss, color='r')
ax.set_title("Seat swings distributed around the national TPP swing (93-16)")
ax.set_xlabel('Seat Swing minus National Swing (percentage points to Labor)') 
ax.set_ylabel('Probability')
ax.axvline(0, color='#999999', linewidth=0.5)

fig = ax.figure
fig.text(0.55, 0.75, 'Normal distribution', fontsize='small', color='red')
fig.tight_layout(pad=1)
fig.text(0.99, 0.01, 'marktheballot.blogspot.com.au', ha='right', va='bottom',
    fontsize='x-small', fontstyle='italic', color='#999999')

fig.savefig('./graphs/national_swing_hist_normal.png', dpi=125)
plt.close() 

# - plot data fitted to Cauchy 
ax = combined['AdjSwing'].hist(bins=BINS, normed=True, color='cornflowerblue')
ax.plot(x, pdf_cauchy, color='r')
ax.set_title("Seat swings distributed around the national TPP swing (93-16)")
ax.set_xlabel('Seat Swing minus National Swing (percentage points to Labor)') 
ax.set_ylabel('Probability')
ax.axvline(0, color='#999999', linewidth=0.5)

fig = ax.figure
fig.tight_layout(pad=1)
fig.text(0.55, 0.75, 'Cauchy distribution', fontsize='small', color='red')
fig.text(0.99, 0.01, 'marktheballot.blogspot.com.au', ha='right', va='bottom',
    fontsize='x-small', fontstyle='italic', color='#999999')

fig.savefig('./graphs/national_swing_hist_cauchy.png', dpi=125)
plt.close() 

# - plot data fitted to t 
ax = combined['AdjSwing'].hist(bins=BINS, normed=True, color='cornflowerblue')
ax.plot(x, pdf_t, color='r')
ax.set_title("Seat swings distributed around the national TPP swing (93-16)")
ax.set_xlabel('Seat Swing minus National Swing (percentage points to Labor)') 
ax.set_ylabel('Probability')
ax.axvline(0, color='#999999', linewidth=0.5)

fig = ax.figure
fig.tight_layout(pad=1)
fig.text(0.55, 0.75, "Student's t distribution", fontsize='small', color='red')
fig.text(0.99, 0.01, 'marktheballot.blogspot.com.au', ha='right', va='bottom',
    fontsize='x-small', fontstyle='italic', color='#999999')

fig.savefig('./graphs/national_swing_hist_t.png', dpi=125)
plt.close() 

# - plot data fitted to normal and t 
ax = combined['AdjSwing'].hist(bins=BINS, normed=True, color='cornflowerblue')
ax.plot(x, pdf_gauss, color='blue')
ax.plot(x, pdf_t, color='red')
ax.set_title("Seat swings distributed around the national TPP swing (93-16)")
ax.set_xlabel('Seat Swing minus National Swing (percentage points to Labor)') 
ax.set_ylabel('Probability')
ax.axvline(0, color='#999999', linewidth=0.5)

fig = ax.figure
fig.tight_layout(pad=1)
fig.text(0.55, 0.70, "Normal distribution", fontsize='small', color='blue')
fig.text(0.55, 0.75, "Student's t distribution", fontsize='small', color='red')
fig.text(0.99, 0.01, 'marktheballot.blogspot.com.au', ha='right', va='bottom',
    fontsize='x-small', fontstyle='italic', color='#999999')

fig.savefig('./graphs/national_swing_hist_normal+t.png', dpi=125)
plt.close() 

No comments:

Post a Comment