Open Side Menu Go to the Top
Register
Election Modeling Election Modeling

05-29-2012 , 04:32 AM
Introduction
I got tired of waiting for Nate Silver to release his own election model so I started building my own. I don’t do this for a living and have new baby, full time job, part time other job and hence limited time. However using the 80/20 principle I’ve found a good intuitive way to use aggregate polls and estimate each candidates’ chance of winning one of a dozen swing states.

My model currently informed only by polling data (taken from RCP) and how far away we are from the election. Attempting to include economic or demographic data would result in a lot more work without much more added value since that information is already captured in polls.

I use a logistic model to translate aggregated polling data into likelihood of winning for each state. The shape of the logistic graph is informed both by how far we are from election day and how much polling has gone on in a state (the curve gets sharper with more polling data and the closer we get to election day).

Translating a candidate’s probability of winning each state is more difficult than it might appear on its face as individual state outcomes are not independent of each other. I’ve run Monte Carlo simulations using each state’s probabilistic outcome, and those outcomes correlated at varying degrees. I am still exploring the best way to estimate interstate correlations and I’m also investigating other possible solutions.

Summary
1) Aggregate state level polling data weighting newer, bigger polls over older, smaller ones.
2) Translate aggregated polls into a likelihood of winning for each candidate.
3) Simulate elections outcomes using each candidate’s chances in each state.
4) I think a discussion of election modeling will do well it its own thread. The model is based entirely on math, not politics, I didn't even know what the outcome would be as I was building it. If you wish to yell about birth certificates, dogs on roofs, the stupid story de jure, etc. etc. (as I frequently do) the general election thread is here

Last edited by goofball; 05-29-2012 at 04:53 AM.
05-29-2012 , 04:44 AM
State Poll Aggregation:
RCP posts poll data as it comes in so unless and until I find a better source I’ll use them. I calculate the poll average weighting each poll by age and sample size. Older smaller polls are less meaningful. At the state level sample size weights work linearly, a 1000 person poll is weighted twice as much a 500 person poll. Age weights work on a half life of 30 days (this may need to chance as the election draws near). A poll from today counts twice as much as a poll from 30 days ago which counts twice as much as a poll from 60 days ago. I may build in a distinction between likely and registered voter polls but I haven’t yet.

Taking Florida as an example, see the spreadsheet below:



The “Other” column is just Other = (100 – Obama – Romney).

The last piece to address is the effective sample. It is the sum of the sample sizes times the poll weights and indicates the effective global sample of the polling done in the state. The more recent big polls have been done the higher the effective sample will be. The effective sample is also used in the translation of aggregate poll to winning %, which will be discussing next.

See the aggregate results for a dozen swing states below.
05-29-2012 , 04:52 AM
State Winning Percentages
After estimating the most accurate possible poll, the next step is to ask, what does that poll tell us about each candidates chance to win the state. I use a logistic model for each state informed by how far we are away from the election and the state’s effective sample size. For example for a fully sample poll (defined at n = 10,000) done today the relationship between polling advantage and chance of winning looks like:



However on Election Day it looks like:



This is steeper (and implies errors smaller) than a typical poll one might read on election day, however it’s important to remember a model using many aggregated polls should have a much lower standard error than any one poll.

The logistic function is calibrated as follows. Polls currently conducted have two main sources of error:
1) The poll sample doesn’t necessarily represent the population (sample size)
2) The poll doesn’t know what will happen between now and election day (cone of uncertainty)

Error source #1 can be calculated. We discount polls against what we have defined as a full sample (10k) using the square root of the ratio of the sample / 10,000 (quite a bit went into selecting square root ratio over log-ratio or standard ratio if anyone is interested).

Quantifying the error from Source #2 is more difficult. The currently model assumes a linear flow of information (we learn, on average, as much going from 60 to 59 days out as we do from 9 to 8 days). We then calibrate the logistic using the assumption that a candidate who is 1 point down in a state in a full 10k sample today has a 45% chance to win that state. Plugging everything in we get the following results:



In summary:
1) Aggregate poll is translated into a likelihood of winning using a logistic curve that gets sharper as sample size increase or the election grows closer.
2) Error comes from two primary sources (1) sample size and (2) cone of uncertainty.
3) The logistic is calibrated assuming the error from (2) is zero on Election Day, and assuming a candidate ahead by 1 point in a full sample today has a 55% chance to win the state.
05-29-2012 , 04:57 AM
Electoral College Simulations
I plugged the calculated winning percentages into a Monte Carlo simulator. If we were to assume no correlation between state outcomes, President Obama would have a 91% chance to win. This is not correct though state outcomes are clearly dependent, so I built state to state correlation into the simulation. Assuming middling state to state correlation drops President Obama’s chance to win to 73%, while using a very high correlation reduces it to 66%. I’m currently using a global factor (every state is equally correlated with every other state, which needs refining but figuring out specific state-level correlations is difficult and involved). I’m also exploring other methods but have yet to make much progress.

Summary:
1) President Obama’s chance to win re-election based on current polling data is probably between 65% and 75%.
2) Florida and North Carolina are currently by far the closest states. Even if Mr. Romney wins both those states (and AZ, and MO) he’ll still need to pick off at least 3 states from President Obama (the most likely being CO, OH, and VA).
3) Lots of polling is being done in FL and OH.
4) Figuring out the correlation in outcomes between states is difficult.
05-29-2012 , 04:57 AM
Next Steps
I’ve got some things on my to do list
1) Add a distinction between likely and registered voter models.
2) Add some feature of momentum (especially as we approach election day) for example I’m not sure I believe the president is still ahead in NC.
3) Refine interstate outcome correlations
4) Build in other things to poll weights (discount outiers? Add house effects?)
5) Others?
05-29-2012 , 06:13 AM
Looks interesting I appreciate the effort. Don't know much about models but taking a class on them and polling this summer so maybe I can offer some insight in a couple of months
05-29-2012 , 10:28 AM
Looks kinda cool. GJ imo.
05-29-2012 , 10:39 AM
I'd be interested to see the LV/RV breakdown wrt polling.

I also think that even with an aggregate of polls, your election day range of error is way too tight. I don't see how a three point poll average advantage could 100% guarantee victory. Hell, eyeballing, a one point advantage on election day implies 85% chance of victory. Have you looked at past elections to see if this holds?
05-29-2012 , 10:39 AM
A couple of things. First of all kudos for the effort, but I think some of your methodology isn't going to give you a good result of the situation on the ground today.

First of all when Romney was polled vs. Obama in April and March he wasn't even the nominee and was getting whacked on a daily basis by Newt and Santorum and many of those candidate's supporters probably weren't saying "Romney" when the pollsters called in as high a number as today when he is their only choice.

Secondly, I haven't taken statistics in 40 years, but I doubt as you double sample size from 1,000 to 2,000 you double the accuracy of the data, but I could be wrong about that.

For example, your methodology gave you practically double the leads in WI and MI for Obama than the RCP averages which use more recent polling data.

Likely voters , especially from multiple sources, would be better than all voters or registered voters, but there isn't enough of that data on the stete by state level yet.

I guess my overall criticism is data from 30 to 90 days ago is pretty meaningless especially when Romney wasn't even the nominee at the time. It is ok for showing trends, but useless when it comes to gauging current sentiment.
05-29-2012 , 11:59 AM
Subscribed.

05-29-2012 , 12:04 PM
Quote:
Originally Posted by goofball
Electoral College Simulations
I plugged the calculated winning percentages into a Monte Carlo simulator. If we were to assume no correlation between state outcomes, President Obama would have a 91% chance to win. This is not correct though state outcomes are clearly dependent, so I built state to state correlation into the simulation.
Can you explain this a bit? I don't see why, e.g., the voters in AZ care about what voters in NM or CO are doing. IOW, the likelihood of Romney winning AZ is unaffected by whether or not Romney wins CO and NM.
05-29-2012 , 12:06 PM
Very cool!
05-29-2012 , 12:13 PM
Quote:
Originally Posted by gusmahler
Can you explain this a bit? I don't see why, e.g., the voters in AZ care about what voters in NM or CO are doing. IOW, the likelihood of Romney winning AZ is unaffected by whether or not Romney wins CO and NM.
The voters in AZ care about the same things as the voters in NM and CO. The sum of the campaigns, the news cycle, etc., is unpredictable, but it has similar effects in different states.

It's the degree to which whatever swings the outcome in one state matters in another state. Imagine that Romney loses Texas. Whatever made that possible means he's going to lose a whole bunch of other states. Or suppose Obama loses California. He's not going to lose CA and win Florida. So the outcomes of states have some correlation.
05-29-2012 , 12:19 PM
I don't know if it helps even at all, but when discussing Ohio, I think it should be known that depending on where you poll, you can get completely different results.

If you were to poll greater Columbus, Cleveland, and Cincinnati (Three of the most populated cities), it would lean greatly to Obama (and interestingly R Paul).

If you polled all other, it would lean toward Romney.

And, if you did some type of combination, again, it would really matter how much the big three cities were weighted.

If I were to actually deem Ohio as an Obama state, it wouldn't be because of the current polling numbers, it would be because the population outside those three cities may not stand behind Romney (and thus, just not vote) as much as they would get out for a more polarizing figure.
05-29-2012 , 12:21 PM
Quote:
Originally Posted by Chips Ahoy
It's the degree to which whatever swings the outcome in one state matters in another state. Imagine that Romney loses Texas. Whatever made that possible means he's going to lose a whole bunch of other states. Or suppose Obama loses California. He's not going to lose CA and win Florida. So the outcomes of states have some correlation.
But if Obama were to F up so badly that he loses CA, wouldn't that already show in the FL polls?
05-29-2012 , 12:46 PM
Quote:
Originally Posted by gusmahler
But if Obama were to F up so badly that he loses CA, wouldn't that already show in the FL polls?
A source of uncertainty in the model is that events that happen between the polls and the election could cause the electorate to favor one candidate over the other. Many events that could cause CA voters to deviate in a pro-Romney way from CA poll results would also cause FL voters to do the same.
05-29-2012 , 12:52 PM
Quote:
Originally Posted by gusmahler
But if Obama were to F up so badly that he loses CA, wouldn't that already show in the FL polls?
goofball is building a simulator that takes % chances of winning each state and turns it into a national result by "running it twice a whole bunch". Sometimes the simulation will have Obama losing CA (today, it won't as the election gets closer). He's saying it's a better simulator if in the simulated universe where Obama loses CA that means he loses elsewhere too.
05-29-2012 , 01:50 PM
Just a suggestion for your model...

IFAIK, undecided voters, in most POTUS elections break for the challenger by large margins.... obama finds himself south of 50% still in most of swing states and though he may have a lead in a bo-mr formula, using a formula based off the <>50% mark for a portion of your weighting might help your accuracy.
05-29-2012 , 01:51 PM
Quote:
Originally Posted by Chips Ahoy
goofball is building a simulator that takes % chances of winning each state and turns it into a national result by "running it twice a whole bunch". Sometimes the simulation will have Obama losing CA (today, it won't as the election gets closer). He's saying it's a better simulator if in the simulated universe where Obama loses CA that means he loses elsewhere too.
OK, I get it. Thanks.
05-29-2012 , 02:01 PM
i don't know the packages in excel, but i know stata has a command to let you generate pseudo-random numbers that have pre-determined statistical correlations. this might be something you want to play with to try to account for the error that state outcomes are not IID and one candidate doing better in michigan probably means that they are also doing better in ohio

you could also think about some sort of random effects model to capture between group differences and cluster states that you think will be correlated in how their outcomes move
05-29-2012 , 02:52 PM
Quote:
Originally Posted by pokerbobo
Just a suggestion for your model...

IFAIK, undecided voters, in most POTUS elections break for the challenger by large margins.... obama finds himself south of 50% still in most of swing states and though he may have a lead in a bo-mr formula, using a formula based off the <>50% mark for a portion of your weighting might help your accuracy.
I recall Nate shooting this down as a myth.
05-29-2012 , 03:14 PM
I'm excited. I'm forecasting to make approximately $15K in profit on panic based "Obama's gonna take yer assault rifles, here's a $650 AK-47 I'll sell you for $3K" sales.

Either that or we get a mormon.

Nice.
05-29-2012 , 04:20 PM
Quote:
Originally Posted by jackaaron2012
If I were to actually deem Ohio as an Obama state, it wouldn't be because of the current polling numbers, it would be because the population outside those three cities may not stand behind Romney (and thus, just not vote) as much as they would get out for a more polarizing figure.
You don't think Obama is a polarizing enough figure to get white religious conservatives in southern Ohio to get off their asses and vote?

A warm bucket of spit could be running on the R side and Obama would drive turn out.
05-29-2012 , 06:07 PM
I wonder if using economic data might be a "leading indicator" of the polling data and therefore be useful this far out from the election? Also, if you built a confidence interval just using national data how does that compare to your state by state analysis?
05-29-2012 , 10:28 PM
Quote:
Originally Posted by 13ball
I'd be interested to see the LV/RV breakdown wrt polling.

I also think that even with an aggregate of polls, your election day range of error is way too tight. I don't see how a three point poll average advantage could 100% guarantee victory. Hell, eyeballing, a one point advantage on election day implies 85% chance of victory. Have you looked at past elections to see if this holds?
Remember the graphs provided are for an aggregate poll of effective sample size 10,000 - iow imagine 10 polls were released on election day all showing a margin of 3 for Romney in Arizona, I think Romney losing arizona in that case would be a big big surprise.

If you look more at individual polls scale, the logistic gives someone who trails by 1 point on election day in a 500 n sample a 37% chance to win, and someone who trails by 3 points a 16% chance to win.

      
m