Accurate Prediction of Electoral Outcomes
Dhruv Madeka

TL;DR
This paper introduces new probabilistic models and scoring methods for predicting electoral outcomes, aiming to improve forecast accuracy and evaluation by leveraging diffusion processes and online learning techniques.
Contribution
It presents novel diffusion and online learning models for election prediction, along with density-based scoring functions for better forecast evaluation.
Findings
Diffusion model effectively captures poll uncertainty over time.
Online learning combined with new scoring functions improves forecast accuracy.
Density-based scoring functions provide a comprehensive assessment of forecast quality.
Abstract
We present novel methods for predicting the outcome of large elections. Our first algorithm uses a diffusion process to model the time uncertainty inherent in polls taken with substantial calendar time left to the election. Our second model uses Online Learning along with a novel ex-ante scoring function to combine different forecasters along with our first model. We evaluate different density based scoring functions that can be used to better judge the efficacy of forecasters. We also propose scoring functions which take into account the entire density of the forecast rather than just a point estimate of the value. Finally, we consider this framework as a way to improve and judge different models performing a prediction on the same task.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel Reduction and Neural Networks · Opinion Dynamics and Social Influence · Advanced Bandit Algorithms Research
Accurate Prediction of Electoral Outcomes
Dhruv Madeka
Abstract.
We present novel methods for predicting the outcome of large elections. Our first algorithm uses a diffusion process to model the time uncertainty inherent in polls taken with substantial calendar time left to the election. Our second model uses Online Learning along with a novel ex-ante scoring function to combine different forecasters along with our first model. We evaluate different density based scoring functions that can be used to better judge the efficacy of forecasters.
1. Motivation
Models of the presidential election are many and varied, each with it’s own focus. There is a vast literature on methods to forecast the presidential elections. Models include those based on fundamental factors [5], Bayesian methods [7], and prediction markets [4]. Other models including the popular FiveThirtyEight [15] combine multiple predictions through hybrid models which combine polls with other data. However, each of these methods suffers the same flaw, they donot incorporate the time uncertainty of the outcome. Using the same distribution at every point in time excludes the fact that a measurement made with greater calendar time to the election has more uncertainty than one made closer to the election (this is because, in filtration terms, a large amount of uncertainty can still be realized).
Consider Figure [1], which shows the FiveThirtyEight probability time series. They report a 90% probability in August, and a 55% probability in October. If a forecaster truly believes that the probabilities move this much, then he should report 50% 111If we consider a Gaussian , and pick a level , as we increase the variance , the probability as (A rigorous proof and better elucidation of this can be seen in [16]). The greater the calendar time to the Election (or Event) the greater time there is for uncertainty to be realized. The absence of this time component of uncertainty renders the model unstable. We propose a simple model, which utilizes a Brownian Motion with volatility. The Brownian Motion is a continous time stochastic process whose variance grows linearly with time. As a result, the greater the calendar time the more the uncertainty in the final realization. Inspired by the seminal CAPM Model in Finance, we propose a CAPM Model for the elections, treating each of the 50 states as a stock and the national popular vote as the market.
Second, we address the notion of comparing forecasters. The conventional method used, the Brier Score, fails to capture different aspects of forecasts that distinguish a skilled forecaster from an unskilled one. We propose the use of two methods to compare forecasters according to the density reported rather than just the probability. However, all of these comparisons are ex-post, namely they rely on the realization of the event. Ideally, we would like to be able to compare forecasters before the event so that we may trade the prediction markets on the event. We use an ex-ante method to compare forecasters in terms of a trading strategy, and then apply this combination to the US Election Betting Market. Our online mixture performs better than most experts.
2. Preliminaries
2.1. Brownian Motion
We work on a filtered probability space . We define a Brownian Motion in the following way:
Definition 1**.**
Any continuous time stochastic process is a brownian motion if it has the following properties [6]:
- •
For
[TABLE]
are independent
- •
* is distributed for ; where is the Normal Distribution with mean and variance *
- •
the process has almost surely continuous paths
2.2. Proper Scoring Functions
Following [12], consider , where defines a convex class of probability measures on . A probabilistic forecast is any probability measure . A scoring function is any extended real-valued function such that is integrable for all in . We define the expected score under Q as:
[TABLE]
A scoring function is said to be strictly proper relative to if:
[TABLE]
. It is strictly proper if the equality holds if and only if .
2.3. Construction of Proper Scoring Functions
We know, from [2], that a characterization of a Bregman Divergence is that:
[TABLE]
But, for a scoring function, where the forecaster reports Q, and his true probability measure is P, we have that:
[TABLE]
Hence, it is always optimal for a forecaster to report his true density function P. Thus, a proper scoring function is equal to the negative of some Bregman Divergence. But we know [3] that to construct a Bregman Divergence, we only need to pick a strictly convex function f and write:
[TABLE]
where denotes the sub-gradient of f at q.
3. Models
3.1. CAPM Model
Our model follows the intuition of the Nobel prize winning Capital Asset Pricing Model introduced by Treynor [17] [18] and Sharpe [14]. We treat each state as a stock with the national popular vote playing the role of the market. To calibrate this model, we consider over 1000 polls that are obtained from RealClearPolitics and model each state’s popular vote as a GAM with the single factor being the national polls. Once each state has been calibrated, we treat the market as a continuous time stochastic process which can take any number of values over the remaining period. Simulating from this model and for each state specific noise, we obtain a number of possible scenarios for each state. The computation of the electoral votes won by each candidate generates a winner for each scenario.
Consider an election with two candidate . At each time we denote the popular vote spread of candidate in each state by and denote by the national popular vote spread (at ). For each state, we assume that the spread follows the model:
[TABLE]
where is some well-behaved function (in our case a non-parametric regression).
Note, that unlike the traditional CAPM model, our equation is in levels rather than returns. This is mainly done to avoid the noise created by multiple polls which are very close to each other. We also make the assumption that the in 2 does not depend on time, although a Kalman filter type methodology could easily be used to incorporate this into the model. We model the national popular vote in the following way:
[TABLE]
where denotes a canonical Brownian motion defined on the space , where denotes the time of the election. We further assume that the white noise .
We use Ordinary Least Squares Regression to calibrate the for each state and analyze the standard deviation of the residuals to obtain . The data methodology is reviewed in Section 6. We use the standard deviation of the national spread as our best estimate for . To forecast, we simulate 10000 paths of the brownian motion and for each path we simulate 50 state specific noises . Doing this allows us to obtain a popular vote estimate for each state in each simulation which we denote by . Converting the popular vote in each state to an electoral vote enables us to obtain a winner for each election.
We define the probability of winning each state (i) as:
[TABLE]
Denote by as the total number of electoral votes for candidate 1 in simulation . Thus and denote as the probability of candidate winning the election, where:
[TABLE]
The time series of probabilities can be seen in Figure 21. The simulations can be seen in Figure 22.
3.2. Bayesian Methods
3.2.1. Robust Regression
For our first advanced model, we assume a similar CAPM structure. However, we additionally postulate that:
[TABLE]
Where and . Finally, we simulate as a Brownian Motion, and draw from the predictive distribution of for the state-specific noise. An example of this can be seen in Figure 23. The simulations can be seen in Figure 24.
3.2.2. Hierarchical Regression
A potential model, is to assume the Graphical Structure in Figure 2.
We can sample our State Level noise from this model, and use the Bachelier Process to simulate the Market.
4. Evaluation of Forecasters
The traditional way of evaluating a forecaster is the Brier Score. Which is defined, for a realizations and a sequence of probabilities as:
[TABLE]
The Brier Score is proper [12] and the lower the value the better. A perfect forecaster would have a Brier Score of 0.
However, we consider a simple example, where two forecasters are asked to give a probability an event , whose true probability is , will happen every day. Forecaster A provides a constant probability of [math] while forecaster B provides a constant probability of . If both provide the same constant probability for 100 time periods, Brier(A) = 1, while Brier(B) = 2.25. It is not clear whether providing a zero probability for a low probability event is optimal. Thus, we need better tail event behaviour. This motivates the log-likelihood.
[TABLE]
As in 20, we see that the Log Likelihood goes parabolically to infinity at the tails, while the Brier/Selten Scores become flat.
4.1. Density Comparison
Both the Brier and the Selten score however provide an incomplete picture. Consider the US Election of 2016, where the two most publicized forecasters were Nate Silver and Dr. Sam Wang. Nate Silver gave Hillary Clinton a ( 70%) chance to win, while Dr. Wang gave her . While the realization was 232 EV for Clinton, a question begs a subtle response, what if Clinton had won in a Reagan-esque landslide (say with ), the Brier and Log Likelihood would say that Dr. Wang was the better forecaster. However, a look at the histogram begs to differ:
We propose two methods to evaluate forecasters based on their histogram.
4.1.1. Selten Score
The first, the Selten Score 222Named after Nobel Laureate Reinhard Selten. Though it was inspired by his argument for the singularity of the Brier Score, we could not find a reference which uses the Brier Score in this way. [13], is equivalent of taking the Brier Score in each bin of the Histogram. We treat each bin independently. We know from [11] that the maximum of a sum of functions is equivalent to the sum of maximums. As a result, the Selten Score is proper.
For bins in the Histogram, each assigned a probability , with a realization , we have:
[TABLE]
We interpret this as a score that rewards assigning a high probability to the correct bin ( and penalizes the forecaster for having too spread out of a distribution ().
4.1.2. CDF Score
The Selten Score takes no account of the topology of the different bins. In order to account for this topology, we propose to use a different scoring function, which looks at the Brier Score above or below each level. The Daily Kos methodology that uses Binomial Models as priors, lends itself to producing spikes in the EV Histogram (see 12). Thus even though the results are similar to the Princeton Election Consortium (see Figure 13), the Selten 11 Score penalizes it as the topology of the bins is not factored into the calculation. Two forecasters who gave the the entire probability mass to a single bin (say 226 and 538) would have the same Selten score, even though the result was 227 EV. We seek a scoring function that factors this topology into account.
[TABLE]
A proof for the propriety of this scoring function is given in Appendix A.
The results for each model on each scoring function is given in Appendix C.
4.1.3. Comparison of Different Scoring Functions
As seen in Figure 20, the behaviour of diffferent scoring functions becomes very important in the tails. While the Selten/Spherical/Brier become flat, the log moves parabolically to , which makes it hypersensitive to low probability events, while the Selten/Spherical/Brier are undersensitive. Once again, we see the benefit of the CDF Score, as it moves linearly to .
5. Trading Score and Online Learning
All of the forecasts given above are ex-post forecasts, i.e., they require the realization of an event. However, in many cases, such as Election Modelling, we would like to judge the efficacy of forecasters before the realization of the event so that we may combine them to create an optimal mixture of forecasts.
For this purpose, we propose the trading score. Consider 2 forecasters predicting a binary event, we assume each posts a time series of forecasts and respectively. Assume each day each forecasters take a position that is proportional to their distance from a reference level . This can be the betting market (see Appendix C for results) or the average of the 2 forecasters. Given that each forecaster takes a position everyday (bought at the betting market or at the average), we can re-evaluate the value of this position everyday and treat the cumulative P&L as an online scoring function. Finally, on a realization of an event, we either settle at the betting market (which converges to the 0 or 1) or at the realization of (0 or 1). See Appendix B for a proof.
Using the trading score, with the betting market as a reference, we employ the weighted majority algorithm, with the trading score and the quadratic loss (brier score) as the Loss function in each case. At each date, we take a position relative to our prediction and then backtest this position over all available time periods. Finally, we look at the entire profit and loss of the algorithm that buys and sells the betting market as its predicted price.
For experts, our weights are initialized , for , and updated as follows:
[TABLE]
with the prediction at each round being:
[TABLE]
where L denotes the cumulative loss of each expert. [9] derive a bound on the Regret upto time T and show that for
[TABLE]
6. Data Methodology for the 2016 Presidential Election
The data for both the national and state polls is obtained from RealClearPolitics. The for each polls consists of the following attributes:
- •
Name of the Pollster
- •
State (= US for National Polls)
- •
Sample Size
- •
Sample Type [Registered Voters, Likely Voters, All]
- •
Trump %
- •
Clinton %
- •
Johnson %
- •
Stein %
Using this data, we assume WLOG that = Clinton % and proceed to use as the primary variable to model. We choose to model only the Democratic and Republican candidate as the leading independent Gary Johnson shows his highest poll % in a New Mexico poll conducted by the Albuquerque Journal [1], where Clinton still leads by 11 %. Though it should be noted here that Johnson winning New Mexico can create many more scenarios of an electoral deadlock, in which case the state popular vote is no longer an accurate way to forecast the winner of the election. A more detailed methodology would allow for a fat tail where this is possible.
6.1. Missing States
Numerous states are either not featured in the poll database or contain fewer than 4 points, making any inference with them extremely dubious.
The following states lacked sufficient data to do any analysis:
- •
Alabama
- •
Alaska
- •
Hawaii
- •
Kentucky
- •
Montana
- •
Nebraska
- •
North Dakota
- •
Oklahoma
- •
South Dakota
- •
Tennessee
- •
West Virginia
- •
Wyoming
- •
Washington D.C.
As a result, to compound the data we used the state’s popular vote spread for the party associated with candidate in each election going back to 1976 along with the national popular vote spread for that year.
6.2. Asynchronity in State and National Polls
As a result of sparse data in each state, there may not be state and national polls available in each time period, or alternately there may be multiple national polls within the same period. To deal with this we calibrate a non-parametric regression with a Gaussian Kernel to the national poll. The chosen bandwidth is 5.0, which is determined empirically.
This methodology may be criticized for its naivety. And in fact, most other public methodologies [15] weight each poll by some measure of it’s quality. However, a recent paper [10] surveys the long history of aggregating polls and questions whether weighting by accuracy measures is relevant when the error arises from multiple sources (not just sampling error). The paper posits that it may be just as, if not more, appropriate to aggregate using a simple weighted average.
6.3. Modeller Data
In order to compile the experts, we scraped the forecasts made publicly available for the following forecasters:
- •
FiveThirtyEight (Now/Plus/Polls’ Models)
- •
Huffington Post
- •
New York Times
- •
Princeton Election Consortium
- •
Daily Kos
Additionally, we use the CAPM and Option Market Models we have created as our 5th and 6th experts. We use the data provided by [8].
7. Results
As we see in Figure 5, the CAPM and the Online Learning Model performed better than everyone except the FiveThirtyEight, the Betting Market, the CAPM and the Option Market Model. The additional noise in the CAPM Model aslo helped it perform very well. In a state average, we see that predicting OH, FL, IA, NC and AZ correctly from early on made the CAPM perform very well. An Electoral Vote weighted average shows that the CAPM performed best.333It is not clear that advanced Bayesian methods are helpful here. The probabilities provided by the Student T regression as well as the Hierarchical Model were very similar to the CAPM Model. If we study the histogram in Figure 24, it is not clear that fat-tailed distributions are relevant here, unless we believe there are non-trivial probabilities of a landslide.
FInally, the log damages the Princeton Election Consortium as they reported 100% probabilities for Clinton winning. The same applies for Daily Kos in the State-wise and EV weighted averages.
We see the flaw in the Selten Score 11 as the DK Histogram in Figure 12 has spikes as an artifact of the modelling technique, which the Selten ignores. The CDF Score 14 on the other hand shows that the two are really comparable.
The Online Learning Algorithm performs very well, 16 but is unable to overcome the fact that most experts were sure about Clinton and there was a strong discontinuity in the betting market towards the end. In terms of Scoring Functions, it performed much better than most algorithms, and was comparable to the CAPM Model.
The trading score performed much better than a conventional quadratic score (see Figure 18) but was more bullish than the quadratic score (see Figure 17). As a result, the final P&L was slightly worse. The MSE (see Figure 19) for predicting the betting market was lower for the Online Algorithms than for any modeller except 538 444As a footnote, its not clear that the betting markets, like all of us are not just following 538. We once again see that there are some issues to be sorted out, while the trading score does solve the ex-ante, ex-post problem, it does not provide a good reference when no market is available. Martingality of prices by the No-Arbitrage Theorems imply that the average is not a good reference555Since by linearity of expectation, taking an equally weighted average should be ideal, and by the Martingale property the best guess of the future value is the current value.
The option methodology does not have a clean way to select securities. Trump’s rhetoric allowed us to interpret the USDMXN peso correctly, but there exists no clean way to select securities. This would be an ideal task for further machine learning research.
Finally, to improve the Online Learning Algorithms, better scoring functions are needed. The CDF score seems like a great candidate, though data for experts is very hard to get. We hope that this paper will motivate the use of density scoring versus simple Brier or Log-Likelihood Scoring. Currently very few forecasters report the density and fewer keep a catalogue of it. Providing this data will allow for a better aggregation of the different forecasters, and better model building.
Appendix A CDF Score is Proper
Consider a forecaster, whose true measure is and who reports .Then his expected score is:
[TABLE]
Differentiating with respect to and setting the derivative , and assuming regularity conditions on the densities, we get:
[TABLE]
Now, we consider a Cramer type of divergence function:
[TABLE]
[TABLE]
which shows that the CDF score is proper.
Appendix B The Trading Score is Proper
Consider, a two date one period model. The position at time t is , which is bought at a price . At time T, we settle at the realization, which gives us
[TABLE]
Hence, the profit and loss of this trading score corresponds to a strictly proper scoring function. A simple induction argument proves this for N time periods. For the position take at period , we consider maximizing the profit and loss.
[TABLE]
Now, we can do the same for , setting , by the tower property of the conditional expectation [9].
Appendix C Results of Models
Appendix D Behaviour of Different Scoring Functions
Appendix E Simulation Results
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Albuquerque Journal. https://www.abqjournal.com/857961/clinton-trump-in-tight-race-in-new-mexico.html , 2016.
- 2[2] A. Banerjee, X. Guo, and H. Wang. On the optimality of conditional expectation as a bregman predictor. IEEE Transactions on Information Theory , 51(7):2664–2669, 2005.
- 3[3] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with bregman divergences. Journal of machine learning research , 6(Oct):1705–1749, 2005.
- 4[4] J. Berg, R. Forsythe, F. Nelson, and T. Rietz. Results from a dozen years of election futures markets research. Handbook of experimental economics results , 1:742–751, 2008.
- 5[5] P. Hummel and D. Rothschild. Fundamental models for forecasting elections.
- 6[6] I. Karatzas and S. Shreve. Brownian motion and stochastic calculus , volume 113. Springer Science & Business Media, 2012.
- 7[7] D. A. Linzer. Dynamic bayesian forecasting of presidential elections in the states. Journal of the American Statistical Association , 108(501):124–134, 2013.
- 8[8] M. Lott and J. Stossel. Election Betting Odds . Betfair data compiled at Election Bettimg Odds.com by Maxim Lott and John Stossel.
