Last week we entered a new paradigm: whether you respond to a survey (R) may depend on outcome (Y), even after controlling for covariates (X). This is called Missing Not At Random (MNAR), in contrast to Missing At Random (MAR). Michael Bailey‘s post pushes beyond MAR. In contrast, replies to the post focus on improving the plausibility of MAR by adding additional covariates Z to the set of adjustment covariates X.
In the comment section of our last post, Andrew pointed to his and Gustavo‘s response: Challenges in Adjusting a Survey That Overrepresents People Interested in Politics. They begin:
We agree with the general point of Bailey (2023) that random sampling is a distant benchmark for real-world polls…
Rod Little reminds us that random sampling is still relevant in the world beyond opinion polling:
might be defensible for opinion polling, but not for the field of survey sampling as a whole… government statistical agencies… survey research organizations…. strive to conduct high-quality probability surveys.
Andrew and Gustavo continue with a focus on adjustment for differences between sample and population:
Adjusting for party identification or voting history can be challenging because these variables are not tabulated in the census…
For one way to address this, see our post on 2 flavors of calibration. There we discussed Kuriwaki et al. 2024, who add variable Z to the auxiliary data X using known totals of Z (e.g. past election results).
Then Andrew and Gustavo pivot to talking about interest in politics, we can call this Z_2 (it’s all the rage in training theory):
It is not clear, though, what to do about this oversampling of people who are interested in politics, given that the distribution of this variable is not known in the general population.
One option for population data on this variable comes from the American National Election Study (ANES), which asks (see the 2020 ANES questionnaire):
(This seems to be a different question than the one Andrew and Gustavo use ?)
The ANES is similar to the organizations that that Rod Little says still “strive to conduct high-quality probability surveys”. The 2020 ANES methodology report says the response rate was 36.7%. (Why does their data quality page only report response rates until 2000 ?) While this is lower than in past years, Sharon Lohr reminds us:
starting with a probability sample has several advantages even when response rates are low:
– The sampling frame for a probability sample is well defined, and many frames used in practice have high coverage….
– The probability sample often has more information available that can be used for weighting, imputation, or other types of nonresponse modeling…
The 2020 ANES methodology report says the sampling frame was the US Postal Service Computerized Delivery Sequence File. Their nonresponse adjustment included single vs multi-family dwelling, whether the address has a telephone number, and census division.
It is unclear how severely the ANES’s non-response mechanism impacts its ability to provide reliable population data. So adjustment for Z_2 (political interest) may not be as good as adjustment for Z (past vote), but perhaps still worth doing ?
Andrew and Gustavo conclude:
we should recognize the potential importance of going beyond conventional adjustment variables.