count 1342.000000 mean -0.014416 std 3.224886 min -14.250000 25% -2.067500 50% -0.080000 75% 1.990000 max 17.350000

Let's start with a normal distribution with a mean of -0.014 and a standard deviation of 3.22 percentage points (as suggested by the raw data). It yields a plot as follows (where the normal curve is fitted to the data in red).

However, there is a potential problem with this normal distribution. There are too many outliers for a normal distribution. In the area between -3 and +3 standard deviations from the mean we would expect to see 99.73 per cent of all observations if the data was normally distributed. We would only expect to see 0.27 per cent of the observations outside of this range. We actually have 0.82 per cent of our observations outside of this range.

We also have more observations bunched in the middle of the distribution. For a normal distribution, you would expect to find 68.27 per cent of the observations between -1 and +1 standard deviations from the mean. We have 70.19 per cent of our observations in the middle of the distribution.

Not surprisingly therefore, the excess kurtosis statistic for the observations is positive: 1.15 (using Fisher’s definition of kurtosis, where the normal curve has a kurtosis of zero). From our sample, it would appear that the distribution of seat swings around the national swing is a touch leptokurtic. More observations than normal are clustered around the mean, but there is also a higher probability than normal of substantial outliers occurring from time to time (which is sometimes referred to as a fat tail risk distortion).

Given the fat tails, I tried a couple of other distributions to see if they would provide a better fit than the Gaussian normal distribution. My attempt to fit a Cauchy distribution - see next chart - yielded a poorer fit (as measured by the sum of squared errors).

However, Student's t-distribution (next chart) was an improvement on the normal distribution in terms of fit. When it comes to a Monte Carlo simulation of election outcomes for individual seats around a national swing, the t-distribution looks the most promising. The parameters for this t-distribution are: df=10.5318772642; loc= -0.0531408340653; and scale=2.89669817523. I suspect it is simply an artifact of the raw data that the location parameter is not zero.

The relationship between the t and normal distributions can be seen in the next chart.

It is worth considering the outliers that lie beyond three standard deviations from the mean. The complete list follows. The critical column is the Adjusted Swing column, which is the swing for the seats minus the national swing (to Labor). There are interesting stories with many of these unusually large swings.

Division StateAb Swing AdjSwing 1993-56 Calwell Vic 11.37 9.83 1993-80 Maribyrnong Vic 11.51 9.97 1993-87 Wills Vic 12.27 10.73 1996-108 Oxley Qld -19.31 -14.25 1996-86 Wills Vic 12.29 17.35 1998-112 Wide Bay Qld 15.32 10.71 2010-53 Fowler NSW -13.81 -11.23 2010-82 Kingston SA 9.49 12.07 2013-53 Fowler NSW 8.04 11.65 2013-92 Lyons TAS -13.51 -9.90 2016-21 Burt WA 13.2 10.07

The python code for this analysis is pretty rough. A fair bit of effort went into munging the raw data from the Australian Electoral Commission (AEC) into something that I could work with. The data format changed a number of times over the years. Swings prior to the 2004 election were always reported from the perspective of the Labor party. From the 2004 election, they are reported from the perspective of the government of the day (going into the election). State abbreviations were not used in the earlier data files from the AEC. There is also quite a bit of arcane code dedicated to statistical tests.

import pandas as pd import numpy as np from pandas import Series, DataFrame import matplotlib.pyplot as plt from scipy import stats plt.style.use('../bin/markgraph.mplstyle') # --- get data # - a quick and dirty labeling function def state_label(df): states = ['New South Wales', 'Victoria', 'Queensland', 'Western Australia', 'South Australia', 'Tasmania', 'Australian Capital Territory', 'Northern Territory'] # note: state order is important state_ab = ['NSW', 'Vic', 'Qld', 'WA', 'SA', 'Tas', 'ACT', 'NT'] df['StateAb'] = None df.iloc[0, 0] = 'New South Wales' for name, ab in zip(states, state_ab): position = df[df['Division'] == name].index[0] df.loc[position:, 'StateAb'] = ab return(df) # - get the very old data - from ugly tab separated files # Note: some seriously ugly munging occurring here # Note: edited the 2001 data file to remove trailing tabs # Note: Swings in this series are from the perspective of the Labor Party # Note: the National Swing in the 2001 data file appears incorrect d = {} # dictionary of divisions old_years = ['1993', '1996', '1998', '2001'] old_nat_swings = [1.54, -5.06, 4.61, -1.93] # positive = to Labor start_rows = [17, 17, 19, 19] previous = None for year, start, swing in zip(old_years, start_rows, old_nat_swings): print('Loading: ' + year + ' ================') # get raw data d[year] = pd.read_table('./Data/' + year + '-V1_9.TXT', header=start, index_col=None, quotechar='"', sep='\t', na_values = ['na', '-', '.', '']) # munge the data into shape d[year] = state_label(d[year]) d[year] = d[year].dropna(axis=1, how='all') # drop empty columns d[year] = d[year].dropna() # drop empty rows d[year] = d[year][~d[year]['Division'].str.contains(' Total')] # drop state totals # some swing checks ... current = (d[year]['ALP'].astype(float).sum() / d[year]['Total'].astype(float).sum() * 100.0) if previous : print('Swing: ', current - previous) previous = current # get relative national swing for each seat d[year]['AdjSwing'] = d[year]['Swing'].astype(float) - swing # positive Labor d[year] = d[year][['Division', 'StateAb', 'Swing', 'AdjSwing']] d[year].index = year + '-' + pd.Series(range(len(d[year]))).astype(str) # - get the more recent data - in CSV files # Note: Swings in this series are from the perspective of the govt. of the day new_years = ['2004', '2007', '2010', '2013', '2016'] new_nat_swings = [-1.79, 5.44, -2.58, -3.61, 3.13] # positive = to Labor new_suffix = ['12246', '13745', '15508', '17496', '20499'] reversables = [True, True, False, False, True] for year, suffix, swing, r in zip(new_years, new_suffix, new_nat_swings, reversables): print('Loading: ' + year + ' ================') # get the raw data d[year] = pd.read_csv('./Data/HouseTppByDivisionDownload-'+suffix+'.csv', header=1, index_col=None, quotechar='"',sep=',', na_values = ['na', '-', '.', '']) # munge the data into shape d[year].rename(columns={'DivisionNm':'Division'}, inplace=True) if r: d[year]['Swing'] = -d[year]['Swing'] # from the perspective of swing to Labor # some swing checks ... current = (d[year]['Australian Labor Party Votes'].astype(float).sum() / d[year]['TotalVotes'].astype(float).sum() * 100.0) if previous : print('Swing: ', current - previous) previous = current # get relative national swing for each seat d[year]['AdjSwing'] = d[year]['Swing'].astype(float) - swing # positive = Labor bias d[year] = d[year][['Division', 'StateAb', 'Swing', 'AdjSwing']] d[year].index = year + '-' + pd.Series(range(len(d[year]))).astype(str) # re-index # - combine the DataFrames combined = pd.DataFrame() for year in sorted(d) : combined = combined.append(d[year]) # --- some summary statistics print(combined['AdjSwing'].describe()) print('Kurtosis: ', combined['AdjSwing'].kurtosis()) # - fit a normal distribution print('Fit a distribution ================') # let's histogram the original data BINS = 100 y, x = np.histogram(combined['AdjSwing'], bins=BINS, normed=True) x = (x + np.roll(x, -1))[:-1] / 2.0 # let's try the normal distribution mean, stdev = stats.norm.fit(combined['AdjSwing']) pdf_gauss = stats.norm.pdf(x, mean, stdev) sse_gauss = np.sum(np.power(y - pdf_gauss, 2.0)) # sum squared error print('SSE for normal PDF: ', sse_gauss) # let's try the Cauchy distribution loc, scale = stats.cauchy.fit(combined['AdjSwing']) pdf_cauchy = stats.cauchy.pdf(x, loc, scale) sse_cauchy = np.sum(np.power(y - pdf_cauchy, 2.0)) # sum squared error print('SSE for Cauchy PDF: ', sse_cauchy) # let's try the t distribution shape, loc, scale = stats.t.fit(combined['AdjSwing']) pdf_t = stats.t.pdf(x, shape, loc, scale) sse_t = np.sum(np.power(y - pdf_t, 2.0)) # sum squared error print('SSE for t PDF: ', sse_t) print('t parameters df, loc, scale: ', shape, loc, scale) # - a quick look at the outliers print('A quick look at outliers ================') outliers = combined[combined['AdjSwing'].abs() > stdev * 3] proportion = len(outliers) / len(combined) print(outliers) print("3 * stdev outlier proportion: ", proportion, '; which compares with a theoretical 0.0027') outliers = combined[combined['AdjSwing'].abs() > stdev * 2] proportion = len(outliers) / len(combined) print("2 * stdev outlier proportion: ", proportion, '; which compares with a theoretical 0.0455') outliers = combined[combined['AdjSwing'].abs() > stdev] proportion = len(outliers) / len(combined) print("1 * stdev outlier proportion: ", proportion, '; which compares with a theoretical 0.3137') # --- and plot ... # - plot data fitted to normal ax = combined['AdjSwing'].hist(bins=BINS, normed=True, color='cornflowerblue') ax.plot(x, pdf_gauss, color='r') ax.set_title("Seat swings distributed around the national TPP swing (93-16)") ax.set_xlabel('Seat Swing minus National Swing (percentage points to Labor)') ax.set_ylabel('Probability') ax.axvline(0, color='#999999', linewidth=0.5) fig = ax.figure fig.text(0.55, 0.75, 'Normal distribution', fontsize='small', color='red') fig.tight_layout(pad=1) fig.text(0.99, 0.01, 'marktheballot.blogspot.com.au', ha='right', va='bottom', fontsize='x-small', fontstyle='italic', color='#999999') fig.savefig('./graphs/national_swing_hist_normal.png', dpi=125) plt.close() # - plot data fitted to Cauchy ax = combined['AdjSwing'].hist(bins=BINS, normed=True, color='cornflowerblue') ax.plot(x, pdf_cauchy, color='r') ax.set_title("Seat swings distributed around the national TPP swing (93-16)") ax.set_xlabel('Seat Swing minus National Swing (percentage points to Labor)') ax.set_ylabel('Probability') ax.axvline(0, color='#999999', linewidth=0.5) fig = ax.figure fig.tight_layout(pad=1) fig.text(0.55, 0.75, 'Cauchy distribution', fontsize='small', color='red') fig.text(0.99, 0.01, 'marktheballot.blogspot.com.au', ha='right', va='bottom', fontsize='x-small', fontstyle='italic', color='#999999') fig.savefig('./graphs/national_swing_hist_cauchy.png', dpi=125) plt.close() # - plot data fitted to t ax = combined['AdjSwing'].hist(bins=BINS, normed=True, color='cornflowerblue') ax.plot(x, pdf_t, color='r') ax.set_title("Seat swings distributed around the national TPP swing (93-16)") ax.set_xlabel('Seat Swing minus National Swing (percentage points to Labor)') ax.set_ylabel('Probability') ax.axvline(0, color='#999999', linewidth=0.5) fig = ax.figure fig.tight_layout(pad=1) fig.text(0.55, 0.75, "Student's t distribution", fontsize='small', color='red') fig.text(0.99, 0.01, 'marktheballot.blogspot.com.au', ha='right', va='bottom', fontsize='x-small', fontstyle='italic', color='#999999') fig.savefig('./graphs/national_swing_hist_t.png', dpi=125) plt.close() # - plot data fitted to normal and t ax = combined['AdjSwing'].hist(bins=BINS, normed=True, color='cornflowerblue') ax.plot(x, pdf_gauss, color='blue') ax.plot(x, pdf_t, color='red') ax.set_title("Seat swings distributed around the national TPP swing (93-16)") ax.set_xlabel('Seat Swing minus National Swing (percentage points to Labor)') ax.set_ylabel('Probability') ax.axvline(0, color='#999999', linewidth=0.5) fig = ax.figure fig.tight_layout(pad=1) fig.text(0.55, 0.70, "Normal distribution", fontsize='small', color='blue') fig.text(0.55, 0.75, "Student's t distribution", fontsize='small', color='red') fig.text(0.99, 0.01, 'marktheballot.blogspot.com.au', ha='right', va='bottom', fontsize='x-small', fontstyle='italic', color='#999999') fig.savefig('./graphs/national_swing_hist_normal+t.png', dpi=125) plt.close()

## No comments:

## Post a Comment