Survey Statistics: Imputation

We started our Survey Statistics adventure with this big mountain: not everyone can be in our sample (“unit nonresponse”). Beyond that mountain is another mountain: not everyone in our sample answers all survey questions (“item nonresponse”). Here “nonresponse” means either not being sampled or asked, as well as refusing to answer. All result in missing data.

For a visual, I like Figure 10.4 from Groves:

Multilevel Regression and Poststratification (MRP) aims to address unit nonresponse. Suppose we want to estimate E[Y], the population mean. But we only have Y for respondents. For example, suppose Y is voting Republican. And what if respondents are more or less Republican than the population ? If we have population data on X, e.g. a bunch of demographic variables, then we can estimate E[Y|X] and aggregate: E[Y] = E[E[Y|X]]. So if our sample has the wrong distribution of X, at least we fix that with some calibration.

But what if some of the X are missing ? From Bayesian Data Analysis p.451:

The paradigmatic setting for missing data imputation is regression, where we are interested in the model p(y|X, θ) but have missing values in the matrix X.

Andrew has blogged about MRP and item nonresponse, recommending one big joint model for Y and X. Or “construct some imputed datasets, and go on and do MRP with those.” More from Bayesian Data Analysis p.451:

First model X, y together…At this point, the imputer takes the surprising step of discarding the inferences about the parameters, keeping only the completed datasets Xs…

This line really helped me understand imputation. Especially the words “surprising step”. Because really, we go to all this trouble to model everything, and then… why aren’t we done ? We would be done if we really believed in this one big joint model. But maybe we want to be more careful, especially about how we model E[Y|X]. So we throw away some of our work and just keep the imputed Xs.

What’s more, we keep multiple versions of these imputed Xs, because we want to reflect our uncertainty about them. Then we combine these multiple versions of our analysis. For more about Multiple Imputation (MI) see, e.g. Stef van Buuren’s book.

Ok, so this sounds sensible ! Implementation time. Here’s where I get stuck:

Scale: You’ve got 1000s of X predictors (in 100s of batches), and 100,000s of survey responses. Everything can be missing.
Cross-validation: Kuh et al 2023 say cross-validation may not be suitable to evaluate the MRP model for E[Y|X], but people do it (Wang & Gelman 2014). Jaeger et al. (2020) remind us to do imputation (which uses the Y) during each cross-validation replicate. They investigate if we can get away with imputation without Y, as a step before cross-validation.

So we’ve got a scale problem, made even worse if we do imputation during cross-validation.

Two recent papers in Statistical Methods in Medical Research look into getting away with single, deterministic imputation of missing Xs without using Y:

D’Agostino McGowan et al. (2024): The “Why” behind including “Y” in your imputation model. See arXiv for access.
Sisk et al. (2023): Imputation and missing indicators for handling missing data in the development and deployment of clinical prediction models: A simulation study.

Let:

Z = observed covariates
X = unobserved covariates
Y = outcome

D’Agostino McGowan et al. (2024) look at continuous Y and linear models for E[Y|X,Z]. Sisk et al. (2023) look at binary Y and logistic models for E[Y|X,Z]. Both consider:

deterministic imputations
- with the outcome Xhat(Z,Y), estimating E[X | Z, Y]
- or without Xhat(Z), estimating E[X | Z]
random imputations
- with the outcome X ~ p(x | z, y)
  (This is the deluxe version of imputation that Andrew recommends.)
- or without X ~ p(x | z)

They both conclude that random imputation models should include Y, while deterministic imputation models should not.

Let’s see how their recommendation does with a linear MRP outcome model E[ Y | Z, X ] = b0 + b1 X + b2 Z + b3 X Z.

Suppose we have a perfect imputation model E[X | Z] and outcome model, then we’d have E[Y | Z, E[X | Z] ] which is just E[Y | Z] (because me telling you Z is the same as me telling you Z and some function of Z).

Then we can iterate the expectation to get E[ E[ Y | Z, X ] | Z] = b0 + b1 E[X | Z] + b2 Z + b3 E[X | Z] Z, getting back the parameters of our true MRP outcome model.

But if the model is logistic, then this doesn’t quite go through. Indeed, Sisk et al. (2023) say they get “minimal bias”, unlike D’Agostino McGowan et al. (2024) who show unbiasedness in the linear case.

So where does this leave us ? The scale issue is serious. With nonresponse bias worsening, we want to adjust for a lot of covariates X. This is in tension with handling missing covariates with one big joint model for Y and X (or with imputation during cross-validation). I appreciate these papers that look into what practitioners are often doing !

From the perspective of a team of practitioners, with lots of people using lots of tables: What is the closest we can get to a single set of imputed values that gives reasonable answers for several downstream calculations (cross-tabs, MRP, etc) ? Maybe imputation uncertainty is mostly swamped by other uncertainties, so we aren’t too concerned with accounting for it ? Ideas ?

Survey Statistics: Imputation

Related Posts

A First Look At Yelp’s New AI-powered Video Feature

AI Chip Startup Groq Is In Talks To Raise $600M At A $6B Valuation

Leave a Reply Cancel reply