This post has 3 questions for you below.
Multilevel Regression and Poststratification (MRP) aims to address nonresponse. Suppose we want to estimate E[Y], the population mean. But we only have Y for respondents. For example, suppose Y is voting Republican. And what if respondents are more or less Republican than the population ? If we have population data on X, e.g. a bunch of demographic variables, then we can estimate E[Y|X] and aggregate: E[Y] = E[E[Y|X]]. So if our sample has the wrong distribution of X, at least we fix that with some calibration.
With nonresponse worsening, we want to adjust for a lot of covariates X (including their interactions !). Estimates from such big models will be unstable without a lot of data and/or regularization.
As Andrew writes about MRP and RPP:
5. The multilevel part of MRP comes because you want to adjust for lots of cells j in your poststrat
…
But it’s not crucial that the theta_j’s be estimated using multilevel regression. More generally, we can use any regularized prediction method that gives reasonable and stable estimates while including a potentially large number of predictors.
Hence, regularized prediction and poststratification. RPP. It doesn’t sound quite as good as MRP but it’s the more general idea.
As Andrew writes about the lasso:
Lasso (“least absolute shrinkage and selection operator”) is a regularization procedure that shrinks regression coefficients toward zero…Somehow it took the Stanford school of open-minded non-Bayesians to regularize in the way that Bayesians always could—but didn’t…I was fitting uncontrolled regressions with lots of predictors and not knowing what to do…
“Over the past few months I [Tibshirani] have been consumed with a new piece of work [with Richard Lockhart, Jonathan Taylor, and Ryan Tibshirani] that’s finally done. Here’s the paper, the slides (easier to read), and the R package.”
QUESTION 1: These links are broken. [They’re no longer broken! I found the correct links and went in and updated them. — AG]
As Andrew writes about the “bet on sparsity principle“:
sparse models can be faster to compute, easier to understand, and yield more stable inferences.
Though this “easier to understand” may be kind of fake, as Andrew says:
lasso (or alternatives such as horseshoe) are fine, but I don’t think they really give you a more interpretable model. Or, I should say, yes, they give you an interpretable model, but the interpretability is kinda fake, because had you seen slightly different data, you’d get a different model. Interpretability is bought at the price of noise—not in the prediction, but in the chosen model.
And:
In most settings I actually find it difficult to directly interpret more than one coefficient in a regression model.
Same.
QUESTION 2: Ok, so forgetting about the interpretability stuff, let’s say I am doing a giant Regularized Prediction and Poststratification (RPP). Do you know of papers doing RPP with sparsifying priors ?
Implementation time.
Starting with the lasso: Tibshirani’s selectiveInference R package gives a logistic example here (our Y is binary too). I extended their example using brms and rstanarm with Laplace priors with scale 1/lambda, see ESL p.72:
mod_rstanarm <- stan_glm( formula = formula_all, data = data, family = binomial(), prior = laplace(0, 1.25) )
Here are my results comparing the point estimates:
QUESTION 3: why does Bayesian inference with Laplace priors (and the same likelihood and data) give posterior means more similar to selectiveInference than lasso ? is the above expected behavior or do you suspect my code has a bug ?
(I asked about this here, in Andrew’s post about a post-selection inference question from Richard Artner.)
Moving away from lasso to other sparsifying priors, Piironen and Vehtari (2017) conclude:
the (regularized) horseshoe consistently outperforms Lasso in terms of predictive accuracy…A clear advantage for Lasso, on the other hand, is that it is hugely faster
brms even stopped supporting lasso in 2023.
In the end while these modifications of the horseshoe population model can help they typically don’t achieve any better performance in practice than a Cauchy population model or the binary mixture normal population model, both of which are much easier to configure and accurately fit out of the box. Consequently it’s hard to argue for the practical utility of the horseshoe model.
So back to QUESTION 2: which sparsifying priors should we use for RPP ?