“Lies, Damn Lies, and Statistics”
Well, as we await the final results in this country that looks more like a banana republic with each passing day, I wanted to take this time to release a short piece that I wrote for work on polling and how it applies to politics generally, not just specifically with the 2020 US election. I’ve edited it for clarity and attempted to explain everything with a lack of jargon and as clear as I can for those unfamiliar with stats.
Ultimately, I will explain why the polls seem to keep missing so drastically, and how to look at them moving forward.
Why Poll in the First Place?
Imagine I want to know how Americans feel about a particular topic. Do Americans prefer option A or option B? Maybe I am trying to gauge if there is enough of a market for a product I have recently envisioned. Whatever the reason, I need to know how Americans as a whole think.
But as you can imagine, calling and asking 320 million Americans their opinion on something would take a whole lot of money and an absurd amount of time. If I am trying to gauge the size of a market, by the time I finished this survey, the people who I had first asked may have gone out and bought a substitute: my market might be way smaller thank I think!
Obviously you cannot simply poll an entire population once the population gets big enough. Instead, we use samples.
Samples — How they work (or don’t work)
A sample is a subset of the population. In other words, rather than asking every single person their opinions on the issue, I ask a small number of them and then extrapolate the results from that small number to the population at large.
For instance, instead of asking all 320 million Americans if they prefer my product or a competitor’s, I might ask 3200 of them.
Now, two questions may emerge:
- 3200 seems pretty small compared to 320 million…how big does a sample need to be?
- Are there any assumptions in this process of “sampling” that we should be aware of?
Let’s address these questions in order.
Samples — Does Size Matter?
The point of sampling is to try to get a result that tells you, with reasonable certainty, that the real result (the result you’d get if you had asked the entire population) is “over there”. In other words, I may not be able to go “42.756% of my sample is interested in my product, so that means exactly 42.756% of America wants my product!” Instead, I would go “42.756% of my sample is interested in my product, and I am highly confident that somewhere between 38 and 45% of the population is interested in my product.”
When we take samples, there are two key things to bear in mind:
- Confidence Interval
- Margin of Error
We need to know how confident in our results do we need to be (confidence interval) and how precise our extrapolations need to be (margin of error).
When you see “45% voting A with 4.4% MOE at 95% CI”, they are (roughly) saying that they are pretty sure that between 40.6 and 49.4% of the population will vote A based on their sample.
But as you can imagine, this is a pretty big swing. And, a key note: Margin of error is usually applied to each candidate’s percentage. So if it says “someone is up by 10”, you have to subtract two times the margin of error from that lead to see if the lead is statistically significant (and not just because you happened to sample a greater percentage of supporters of A than there are in the population):
So when you see people debating over whether or not the polls got it wrong, make sure to remember that if the polls aren’t statistically significant, they don’t really tell us anything at all, since any lead could just be due to poor sampling.
Okay, so now, how do we extrapolate from a sample with any confidence?
Well, size does matter. Survey Monkey has an easy to understand calculator you can play with to see this. If I want to be pretty sure (typically use a 95% confidence interval) in a result with a 5% margin of error, I only need 385 respondents! But if I want to be really sure (typically use a 99% confidence interval) in a result with a 1% margin of error, I need over 16 thousand respondents!
Depending on your desired confidence and precision in the extrapolation from sample to population, size does matter (at least in statistics).
Samples — Hidden Assumptions and how Politics Breaks Them
Something to bear in mind: there are two key assumptions that under-gird calculations of sample size:
- Every unit in the population has an equal likelihood of being selected
- The data you collect from the unit is accurate
Now, you are going to notice some things looking at these: neither of these seem to hold up when it comes to political polling!
For instance, we know that different demographics primarily use different mediums to access information (Internet vs tv vs radio), perhaps explaining why polls might give very different results for radio vs internet respondents.
So we know demographics might be a major shift here. And honesty with pollsters? Ho boy……
Social desirability bias, the idea that people will give pollsters an answer they think the pollsters want instead of the answer they actually believe, is a well known phenomenon. Jonathan Pie describes this phenomenon exceptionally well in his 2016 post-election video.
A question pollsters need to grapple with: “When are people going to be honest with you, and in which way are they likely to lie?”
So, if different people have a different probability of being polled and it isn’t even clear they will give you an honest answer, how do we use polls to project an election? In theory, this is where the men are separated from the boys, but the “men” tend to do pretty damn badly:
One More Hurdle — Projecting the Population
One of the biggest problems pollsters face is projecting the electorate. This new level of uncertainty isn’t just “how well does my sample project onto the population” but “so…what does the population actually look like?”
Breaking down demographics allows for specificity, but you could be looking at subpopulations that end up muddying the waters. (In this election, including latinos as a single bloc while splitting white working class and white college-educated seems to be a mistake as latinos seem to need a split as well).
And will the electorate be 33% Democrat/33% Republican/33% Independent? Or 40/30/30? Or 25/35/40? Being able to project an electorate properly should also distinguish winners from losers, but everyone but Trafalgar (how? idk) looks like a loser in 2020.
So Now What?
Well, polling seems to be largely bunk. Misses in 2014 (US Midterms), 2015 (UK GE and Brexit), 2016 (US Elections), 2019 (Australian Elections), and now 2020. How do we move on from now? Many people are looking at innovative approaches with party registration and other forms of looking for “revealed preferences”. But ideally we should have at least one major pollster that understands how to run effective polls and has a model that can correct for sampling biases and make decent projections. But like with everything else in this decrepit and decaying country, we are left wanting.
Have a good day friends.