Zero-Inflated Data: A Comparison of Regression Models

on a regression problem. I knew that the target I wanted to design a predictive model for was countable (i.e. 0, 1, 2, …). Consequently, I immediately thought about choosing a Generalized Linear Model (GLM) with a relevant discrete distribution, like the Poisson distribution or the Negative binomial distribution. But everything did not go as well as expected. I mistook haste for speed.

Zero inflated data

First of all, let us have a look at a dataset suitable for the post. I have chosen the results of the NextGen National Household Travel Survey [1]. The variable of interest, named “BIKETRANSIT”, is the number of “days in last 30 days cycling used”, so this is an integer value between 0 and 30 for daily users. Here is a histogram of the variable in question.

Histogram of the number of cycling days

We can clearly see the countable data is zero inflated. Most of the respondents have not used a bike a single day over the last 30 days. I have also noticed some interesting patterns: there tend to be more people reporting bike use on exactly 5, 10, 15, 20, 25, or 30 days compared to the adjacent numbers. This is probably because respondents prefer to choose round numbers when they are unsure of the precise count. Whatever the reason, in this post we will focus primarily on the issue of zero inflation by comparing models designed for zero-inflated count data.

Several survey fields have been selected as independent variables to explain the number of bike days (e.g., age, gender, worker category, education level, household size, and district characteristics). I intentionally excluded features that count the number of days spent on other activities (such as using taxis or shared bikes), since some of them are highly correlated with the outcome of interest. I want the model to remain realistic: predicting bike usage over 30 days based on taxi, car, or public transport usage over the same period would not provide meaningful insights.

Poisson regression limits

Before introducing the zero inflated model, I would like to illustrate the limit of the Poisson regression, which I first considered for this dataset. I have not looked at the Negative Binomial distribution in the section. Poisson regression assumes that the dependent random variable Y follows a Poisson distribution, conditional on the independent variables X and the parameters β.

So, let’s take a look at some empirical distributions of Y∣X,β. Since I included many features, it is difficult to find a large number of observations with exactly the same values of X. To address this, I used a clustering algorithm — AgglomerativeClustering from scikit-learn [2] — to group observations with similar feature profiles.
First, I preprocessed the data so that it can feed the regression models and also the clustering algorithm. I do not want to spend too much explaining all the preprocessing steps as this post does not focus on it. The full preprocessing code is available on a repo [8]. In brief, I encoded the categorical features using one-hot encoding. I also applied several preprocessing steps to the other features: imputing missing values, clipping outliers, and applying transformation functions where appropriate. Finally, I performed clustering on the transformed dataset.

from sklearn.cluster import AgglomerativeClustering
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

pipe = Pipeline(
    [
        ("scaler", StandardScaler()), # I normalized the data as some numerical features, like age, have range of value greater than the one hot encoded features and I know clustering works based on some distance
        ("cluster", AgglomerativeClustering(n_clusters=100)) # I chose 100 clusters to have many observations in the biggest groups
    ]
)
cluster_id = pipe.fit_predict(X_train_preprocessed) # here X_train_preprocessed is numerical dataframe, after encoding the categorical features

Then I estimated the parameter of the Poisson distribution Ⲗ with the unbiased estimator being the mean of the observed random variables for each group of the cluster.

I then plotted the empirical histograms along with the probability mass functions of the fitted Poisson distributions for several groups of observations. To assess the quality of the fit, I computed both the cross-entropy and the entropy, noting that entropy serves as a lower bound for cross-entropy according to Gibbs’ inequality [3]. A good model should produce a cross-entropy value close to the entropy (though slightly larger).

For this analysis, I focused on three of the largest groups, since parameter estimation is more reliable with larger sample sizes. This is particularly important here because the data is skewed due to zero inflation, making it necessary to collect many observations. Among the groups, two contain bike users, while one group (228 respondents) reported no bike usage at all. For this last group, no Poisson distribution was fitted, as the Poisson parameter must be strictly greater than zero. Finally, I used a vertical log scale in the plots to account for the zero inflation.

I find it difficult to evaluate the quality of the fitted distribution by looking at the entropy and the cross entropy. However I can see that the histogram and the probability mass function differ a lot. This is why I then considered the Zero Inflated Poisson (ZIP) distribution.

Zero inflated data adapted models

Models designed for zero-inflated data aim to capture both the high probability of zeros and the relatively low probabilities of other events. I explored two main families of such models:

“Zero-inflated models […] model the zeros using a two-component mixture model. […] The probability of the variable being zero is determined by both the main distribution and the mixture weight”. “A zero-inflated model can only increase the probability of P(x = 0)” [5]. For notation, I use the following setup (slightly different from Wikipedia and other sources). Let X1 be a hidden variable following a Bernoulli distribution. In my notation, the probability of success is p (whereas Wikipedia uses 1-π). Let X2 be another hidden variable following a distribution that allows zeros with nonzero probability. For my use case, I assume X2 is discrete. The observed variable is then defined as X=X1*X2 which leads to the following probability mass function:
We can notice that X1 and X2 are partially hidden. When X=0, then we cannot know the values of X1 and X2, but as soon as X>0, both variables X1 and X2 are known.
Hurdle models model the observable “random variable […] using two parts, the first of which is the probability of attaining the value 0, and the second part models the probability of the non-zero values” [5]. Unlike zero-inflated models, the second component must follow a distribution in which the probability of zero is exactly zero. Using the same notation as before, X1 models whether the observation is zero or non-zero (typically via a Bernoulli distribution). X2 follows a distribution that assigns no probability mass to zero. Consequently, the probability mass function is:

Zero Inflated Poisson model

Let us have a look a the Zero Inflated Poisson model [4]. The ZIP probability mass function is:

It is now possible to extend the previous histograms and Poisson-fitted probability mass functions by adding the ZIP-fitted probability mass functions. To do this, estimators of the two parameters, p and λ, are required. I used the method of moments to derive these estimators: the first two moments provide a system of two equations with two unknowns, which can then be solved.

Moment method to get ZIP parameter estimators

So the parameter estimators are:

Finally I have plotted the same two figures with the fitted ZIP distribution probability mass functions as well as the cross entropy measures.

Both visual inspection and cross-entropy values show that the ZIP model fits the observed data better than the Poisson model. This provides an objective and quantifiable reason to prefer ZIP regression over Poisson regression.

Model comparison

Let us now compare several models. I split the data into training and test sets, but it was not immediately clear which evaluation metrics would be most appropriate. For instance, should I rely on Poisson deviance, even though the data is zero-inflated? Or mean squared error, which heavily penalizes outliers? In the end, I chose to use multiple metrics to better capture model performance: mean absolute error, Poisson deviance, and correlation. The models I evaluated are:

A naïve model predicting the mean value of the training set,
Linear regression (lr),
Poisson regression (pr),
Zero-inflated Poisson regression (zip),
A chained Logistic–Poisson regression (hurdle model, lr_pr),
A chained Logistic–Zero-Truncated Poisson regression (hurdle model, lr_tpr).

ZIP model

Let us look at the ZIP regression implementation. First the negative log likelihood of the observed data, noted y, is:

The marginal likelihood of the observed data, P(Y=y), can be expressed analytically without the integral formulation of the joint distribution, P(Y=y, X1=x1). So it can be optimized directly without needing to use the expectation minimization algorithm [6]. The two distribution parameters p and Ⲗ are functions of the features X and the parameters of the model β that will be learnt. I have chosen that p is defined as the sigmoid of the dot product between X and β and Ⲗ is defined as the exponential of the dot product between X and β. To make the model more flexible, I use separate sets of parameters β: one for p and another for λ.

Moreover, I have added a prior on the parameters β to regularize the model, especially useful for the Poisson model for which there is few observations because of the zero inflation. I have assumed a Normal prior, hence the L2 regularization terms added to the loss function. I have assumed two different priors, one on the β for the Bernoulli model and one on the β for the Poisson model, hence the two α hyper parameters, noted as alpha_b and alpha_p attributes in the model. I have optimized these values by means of a hyper parameter optimization.

I created a class that inherits from scikit-learn’s BaseEstimator. The Python implementation of the loss function is shown below (implemented within the class, hence the self argument):

def _loss(self, beta: np.ndarray, X: np.ndarray, y: np.ndarray) -> float:
    n_feat = X.shape[1]

    # split beta into two parts: one for bernoulli p and one for poisson lambda
    beta_p = beta[:n_feat]
    beta_lam = beta[n_feat:]
    
    # get bernoulli p and poisson lambda
    p = sigmoid.val(beta_p, X)
    lam = exp.val(beta_lam, X)
    
    # initialize negative log likelihood
    out = 0
    
    # y == 0
    y_e0_mask = np.where(y == 0)[0]
    out += np.sum(-np.log((1 - p) + p * np.exp(-lam))[y_e0_mask])
    
    # y > 0
    y_gt0_mask = np.where(y > 0)[0]
    out += np.sum(-np.log(p)[y_gt0_mask])
    out += np.sum(-xlogy(y, lam)[y_gt0_mask])
    out += np.sum(lam[y_gt0_mask])
    
    # prior
    mask_b = np.ones_like(beta)
    mask_b[n_feat:] = 0
    mask_p = np.ones_like(beta)
    mask_p[:n_feat] = 0
    if self.fit_intercept:
        mask_b[n_feat - 1] = 0
        mask_p[2 * n_feat - 1] = 0
    out += 0.5 * self.alpha_b * np.sum((beta * mask_b) ** 2)
    out += 0.5 * self.alpha_p * np.sum((beta * mask_p) ** 2)
    
    return out

In order to optimize the loss objective function, I have also computed the jacobian of the loss.

The Python implementation is:

def _jac(self, beta: np.ndarray, X: np.ndarray, y: np.ndarray) -> np.ndarray:
    n_feat = X.shape[1]

    # split beta into two parts: one for bernoulli p and one for poisson lambda
    beta_p = beta[:n_feat]
    beta_lam = beta[n_feat:]

    # get bernoulli p and poisson lambda
    p = sigmoid.val(beta_p, X)
    lam = exp.val(beta_lam, X)

    # y == 0 & beta_p
    jac_e0_p = np.expand_dims(
        np.where(
            y == 0,
            (1 - np.exp(-lam)) / ((1 - p) + p * np.exp(-lam)),
            np.zeros_like(y),
        ),
        axis=1,
    ) * sigmoid.jac(beta_p, X)
    # y == 0 & beta_lam
    jac_e0_lam = np.expand_dims(
        np.where(
            y == 0,
            p * np.exp(-lam) / ((1 - p) + p * np.exp(-lam)),
            np.zeros_like(y),
        ),
        axis=1,
    ) * exp.jac(beta_lam, X)

    # y > 0 & beta_p
    jac_gt0_p = np.expand_dims(
        np.where(y > 0, -1 / p, np.zeros_like(y)), axis=1
    ) * sigmoid.jac(beta_p, X)
    # y > 0 & beta_lam
    jac_gt0_lam = np.expand_dims(
        np.where(y > 0, 1 - y / lam, np.zeros_like(y)), axis=1
    ) * exp.jac(beta_lam, X)

    # initialize jac
    out = np.concatenate((jac_e0_p + jac_gt0_p, jac_e0_lam + jac_gt0_lam), axis=1)

    # jac for prior
    mask_b = np.ones_like(beta)
    mask_b[n_feat:] = 0
    mask_p = np.ones_like(beta)
    mask_p[:n_feat] = 0
    if self.fit_intercept:
        mask_b[n_feat - 1] = 0
        mask_p[2 * n_feat - 1] = 0

    return (
        np.sum(out, axis=0)
        + self.alpha_b * beta * mask_b
        + self.alpha_p * beta * mask_p
    )

Unfortunately the loss function is not convex, a local minima is not guaranteed to be a global minima. I have chosen the light implementation of Broyden-Fletcher-Goldfarb-Shanno from scipy because it is faster than the gradient descent methods that I have tested.

res = minimize(
    self._loss,
    np.zeros(2 * n_feat),
    args=(X, y),
    jac=self._jac,
    method="L-BFGS-B",
)

The entire class is coded in this file from the shared repo.
After performing an hyper optimization tuning phase to get the best regularization hyper parameters, I have finally computed the chosen metrics on the test set. The fitting time has been displayed in addition to the metrics.

Zero-inflated models — both ZIP and hurdle — achieve better metrics than the naïve model, linear regression, and standard Poisson regression. I initially expected a larger performance gap, given that the empirical histogram of the observed Y more closely resembles a ZIP distribution than a Poisson distribution. The improvement, however, comes at the cost of longer fitting times, particularly for the ZIP model. For this use case, hurdle models appear to offer the best compromise, delivering strong performance while keeping training time relatively low.

One possible reason for the relatively modest improvement may be that the data does not strictly follow a ZIP distribution. To investigate this, I ran another benchmark using the same models on a synthetic dataset specifically generated to follow a ZIP distribution. This dataset was designed to have approximately the same number of observations and features as the original one, but with a target variable that follows ZIP distribution by design.

Benchmark results for a fake ZIP distributed dataset

When the target truly follows a ZIP distribution, the ZIP model outperforms all the other models considered. It is also worth noting that, in this synthetic setup, the features are no longer sparse (by design), which may help explain the reduction in fitting time.

Conclusions

Before choosing a statistical model, it is crucial to carefully analyze the dataset rather than relying solely on prior assumptions about its characteristics. Examining the empirical distribution — such as through histograms — often reveals insights that guide the choice of an appropriate probability model.

This is particularly important for zero-inflated data, where standard models may struggle. A synthetic example with a zero-inflated Poisson (ZIP) distribution shows how the right model can provide a much better fit compared to alternatives, even if those alternatives are not entirely misguided.

For zero-inflated datasets, models such as the zero-inflated Poisson or hurdle models are especially useful. While both can capture excess zeros effectively, hurdle models sometimes offer comparable performance with faster training.

Zero-Inflated Data: A Comparison of Regression Models

Zero inflated data

Poisson regression limits

Zero inflated data adapted models

Zero Inflated Poisson model

Model comparison

ZIP model

Conclusions

Further readings

References

Leave a Reply Cancel reply

Zero-Inflated Data: A Comparison of Regression Models

Zero inflated data

Poisson regression limits

Zero inflated data adapted models

Zero Inflated Poisson model

Model comparison

ZIP model

Conclusions

Further readings

References

Related Posts

How to Context Engineer to Optimize Question Answering Pipelines

Showcasing Your Work on HuggingFace Spaces

Leave a Reply Cancel reply