Oversampling (Or why there’s no Democratic conspiracy in the polls)
It has now been a little over two weeks since the largely unexpected election of Donald Trump. The election was unexpected for a number of reasons, one of which is because the polls were highly off. There wasn’t a single poll in Michigan or Wisconsin showing Trump ahead, and Clinton was ahead (or statistically tied) with Trump in many of the states in which she lost.
One common complaint I’ve heard throughout the internet, mostly among people who think Clinton “rigged the election”, is that the polls were wrong due to oversampling. To people who do not know statistics, “oversampling” sounds like a conspiracy. It seems to imply that the Clinton campaign intentionally sampled too many black people and Hispanics in order to make it look like she had a greater chance of winning.
However this is a fundamental misunderstanding of what oversampling is. Oversampling isn’t a way for pollsters to blind themselves about demographics and how they vote. In fact, it’s precisely the opposite.
Oversampling is performed to more accurately obtain information about demographic groups that make up a small proportion of the population. For example, imagine a national pollster surveying every state to measure support for Clinton and Trump. If the survey consists of 1000 people and is conducted at random, then on average, 2 people will be from Wyoming. There is no way to use such a sample to draw meaningful inference on the voting preferences for the entire state of Wyoming. If pollsters are interested in Wyoming, then they would need to deliberately oversample people from Wyoming. Having 30 or 40 individuals from Wyoming could potentially be enough to draw inference about its voting preferences.
In practice, oversampling is often used to survey “interesting” demographic groups that would not be captured in regular polls. Hispanics who do not speak English are an interesting demographic group because they make up large portions of particular regions of the country, like Arizona and parts of Florida, and because their voting preferences can have sway over an election. However they are rarely included in polls because most polls are English-only. In order to capture non-English speaking Hispanics, it makes sense to intentionally call more of them and ask questions in Spanish than to indiscriminately call everyone and hope that some of them are included in the sample.
The most common criticism of oversampling is one that seemingly makes sense. If a rare demographic group with high Clinton support is oversampled, doesn’t that skew the sample in favor of Clinton supporters? This criticism is easily accounted for by a technique called poststratification. Rather than explaining what poststratification is, I will give an example and how it works will become immediately apparent.
Imagine you want to study support for a political candidate among men and women in the population. You call 100 people, all of whom respond, and 35 of them are men and 65 of them are women. You know that in reality, approximately half the population is male and half the population is female, so you can weigh the men more highly to allow the sample to represent the population.
Similarly, if the Clinton campaign wants to measure support of Florida Hispanics over the age of 65, the Census Bureau provides population frequencies that can be used to adjust the poll so that any polled Florida Hispanics over the age of 65 are adjusted to their true population makeup.
Poststratification is fantastic because it allows for deliberately non-representative polls to provide meaningful information about the population of interest in a survey. Andrew Gelman, Professor of Statistics and Political Science at Columbia University, used poststratification on a highly non-representative XBOX poll of the 2012 presidential election to accurately predict an Obama victory. This is particularly remarkable not just due to the nature of the poll, but because it accurately forecasted support among demographic groups that are not common XBOX users, like 65+ year old women.
The polls were quite off for a Trump victory in Midwest states like Michigan and Wisconsin, and it will be a while until we figure out exactly why. However it is unlikely that oversampling is the primary culprit, as it provides highly useful information about important demographic groups.