Stepwise Selection Made Simple: Improve Your Regression Models in Python

To get the most out of this tutorial, you should already have a solid understanding of how linear regression works and the assumptions behind it. You should also be aware that, in practice, multicollinearity is addressed using the Variance Inflation Factor (VIF). In addition, you need to understand what prediction risk means, and be familiar with the basics of Python as well as its core functions.

At the end of this article, you will find the code for the stepwise selection procedure used here. The implementation follows two key principles: orthogonality and Don’t Repeat Yourself (DRY), ensuring clean, modular, and reusable code.

Reducing the number of variables in a regression model is not only a technical exercise; it is a strategic choice that must be guided by the objectives of the analysis. In a previous work, we demonstrated how simple tools, such as correlation analysis and the Variance Inflation Factor (VIF), can already shrink a dataset with hundreds of predictors into a far more compact model. Yet, even after this initial reduction, models often still contain too many variables to work effectively. A smaller model with fewer predictors offers several advantages: it may yield better predictions than a larger model, it is more parsimonious, hence easier to interpret, and it often generalizes better. As more variables are added, the model’s bias decreases but its variance increases. This is the essence of the bias–variance trade-off: too few variables lead to high bias (underfitting), whereas too many lead to high variance (overfitting). Good predictive performance requires a balance between the two.

This raises a fundamental question for anyone working with regression models: how do we decide which variables deserve to be included in the model? In other words, how can we reduce the dimensionality of our data without losing essential information?

The challenge depends on the purpose of the analysis. Should the model provide precise estimates of the coefficients? Should it identify which predictors are important? Or should it maximize predictive accuracy? Each of these objectives calls for different approaches to model selection, and ignoring this distinction can lead to misleading conclusions.

In this article, we address the challenge of model selection in regression. We begin by outlining the general framework of linear regression (readers already familiar with it may skip this section). We then review the main scoring criteria used to evaluate competing models, followed by a discussion of the procedures that allow us to explore subsets of the possible model space. Finally, we illustrate these methods with a Python application using the Communities and Crime dataset.

1. Framework of linear regression.

In this section, we provide a brief overview of the linear regression model. We begin by describing the dataset, including the number of observations and the number of covariates. We then introduce the model itself and outline the assumptions made about the data.

We assume that we have a dataset with n observations and p covariates. The response variable is denoted by Y is continuous and the covariates are denoted by X₁, … , X_p. We assume that the relationship between the response variable and the covariates is linear, that is:

for i = 1, …, 𝑛, where β₀ is the intercept, β_j is the coefficient of the 𝑗-th covariate, and εᵢ is the error term. We assume that the error term is independent and identically distributed (i.i.d.) with mean zero and variance σ².

With the regression framework in place, the next step is to face the central challenge of model selection: how do we compare different subsets of variables?

2. Scoring Criteria for Evaluating Competing Models

In model selection, the first challenge is to assign a score to each model, where a model is defined by a particular subset of covariates. This section explains how models can be scored.

Let us first discuss the problem of scoring models. Let S ⊂ {1, …, p} and let 𝓧_S = {Xⱼ : j ∈ S} denote a subset of the covariates. Let β_Sdenote the coefficients of the corresponding set of covariates and let β̂_S denote the least squares estimate of β_S. Also, let X_S denote the X matrix for this subset of covariates and define r̂_S(x), to be the estimated regression function. The predicted values from model S are denoted by Ŷᵢ(S) = r̂_S(Xᵢ). The 𝐩𝐫𝐞𝐝𝐢𝐜𝐭𝐢𝐨𝐧 𝐫𝐢𝐬𝐤 is defined to be

where Y_i* is the future observation of Y_i at the covariate X_i.

The goal of model selection is to find the subset S that minimizes the prediction risk R(S).

With data, we cannot directly compute the prediction risk R(S). In this situation, we generally use its estimate based on the available data. The estimation of the prediction risk is used as our scoring criterion.

The naive estimate of the prediction risk that we can use is: the training error, which is defined as:

where Y_i is the observed value of the response variable for the i-th observation.

However, the training error is very biased as an estimate of the prediction risk. It is always smaller than the prediction risk . In fact,

What explains this biaïs is that the data is used twice: once to fit the model and once to compute the training error. When we fit a complex model, with many parameters, the covariance 𝐶𝑜𝑣(Ŷᵢ(S), 𝑌ᵢ) will be large and the bias of the training error will get worse. This is why we need to use a more reliable estimate of the prediction risk.

2.1 Mallow’s C_p statistic

The Mallow’s C_pstatistic is a popular method for model selection. It is defined as:

Here, |𝑆| is the number of terms in 𝑆, and σ̂² is the estimate of σ², the variance of the error term obtained from the full model with all variables (𝑘). This value represents the training error plus a bias correction. The first term measures how well the model fits the data, while the second term measures the model’s complexity. The more complex the model, the larger the second term will be, and consequently, the larger the Mallow’s 𝐶ₚ statistic will be.

Mallow’s 𝐶ₚ statistic represents a trade-off between model fit and complexity. Finding a good model therefore requires balancing these two aspects. The goal is to identify the model that minimizes the Mallow’s 𝐶ₚ statistic.

Beyond Mallows’ C_P, model selection criteria can also be derived from likelihood-based estimation with penalty terms. This leads us to the next family of methods.”

2.2 Likelihood and penalization

The approach below to estimate the prediction risk is based on the maximum likelihood estimation of the parameters.

In the hypothesis that the error term is normally distributed, the likelihood function is given by:

If you compute the maximum likelihood estimate of the parameters β and σ², for the model 𝑆, that has |𝑆| variables, you will get respectively:
β̂(𝑆)_mv = β̂(𝑆)_mco and σ̂(𝑆)²_mv= (1/𝑛) ∑ᵢ₌₁ⁿ (𝑌ᵢ − 𝑌̂ᵢ(𝑆))².

The log-likelihood of the model for the model $S$ which has $|S|$ variables is then given by:

Choosing the model that maximizes the log-likelihood is equivalent to choosing the model that have the smallest residual sum of squares (RSS), that is:

In order to minimize a criterion, we use the negative log-likelihood. The criterion is generally defined as:

−2𝓁(𝑆) + 2|𝑆|·𝑓(𝑛)

where 𝑓(𝑛) is a penalty function that depends on the sample size 𝑛.
This formulation allows us to define the AIC and BIC criteria, given as follows:

2.2.1 Akaike Information Criterion (AIC)

Another approach to model selection is the Akaike Information Criterion (AIC). The goal of the AIC is to identify the model that minimizes information loss. In practice, this means choosing the subset S that minimizes the AIC, defined as:

AIC(𝑆) = −2𝓁ₛ + 2|𝑆|

where 𝓁ₛ is the log-likelihood of model 𝑆 evaluated at the maximum likelihood estimates of its parameters. Here, 𝑓(𝑛) = 2.

This criterion can be viewed as a combination of goodness of fit and model complexity.
When comparing two models, the one with the lower AIC value is preferred.

2.2.2 Bayesian Information Criterion (BIC)

The Bayesian Information Criterion (BIC) is another method for model selection. It is similar to the AIC, and BIC is defined as:

BIC(𝑆) = −2𝓁ₛ + |𝑆|·log(𝑛)

where 𝓁ₛ is the log-likelihood of model 𝑆 evaluated at the maximum likelihood estimates of its parameters.

It is called the Bayesian Information Criterion because it can be derived from a Bayesian perspective. Let 𝑆 = {𝑆₁, …, 𝑆ₘ} denote a set of models. If we assign each model 𝑆ᵢ a prior probability π(𝑆ᵢ) = 1/𝑚, then the posterior probability of model 𝑆ᵢ given the data is proportional to its likelihood. This leads to the following expression:

Thus, choosing the model that minimizes the BIC is equivalent to choosing the model with the highest posterior probability given the data.

BIC also has an interpretation in terms of minimum description length: it balances model fit against complexity. Because its penalty term is 𝑓(𝑛) = ½·log(𝑛), the BIC applies a stronger penalty than the AIC when 𝑛 > 7. As a result, BIC typically selects more parsimonious models than AIC, especially as the sample size grows.

As with AIC, when comparing two models, the preferred model is the one with the lower BIC.

2.3 Leave-One-Out Cross-Validation (LOOCV) and k-Fold Cross-Validation

Another widely used method for model selection is leave-one-out cross-validation (LOOCV). In this approach, the risk estimator is defined as:

where Ŷ₋ᵢ(𝑆) is the prediction for 𝑌ᵢ using model 𝑆 fitted on all observations except the i-th one, and 𝑌ᵢ is the actual response for the i-th observation.

It can be shown that:

where hᵢᵢ(𝑆) is the i-th diagonal element of the hat matrix: H_S = X_S (X_Sᵀ X_S)⁻¹ X_Sᵀ for the model S.

This formulation shows that it is unnecessary to refit the model repeatedly by leaving out one observation at a time. Instead, LOOCV can be computed directly using the fitted values and the hat matrix.

A natural extension of LOOCV is k-fold cross-validation, where the data is partitioned into k folds, and the model is trained on k − 1 folds and validated on the remaining fold. This process is repeated across all folds, and the results are averaged to estimate prediction error.

K-Fold Cross-Validation

In this approach, the data is divided into 𝑘 groups, or folds (commonly 𝑘 = 5 or 𝑘 = 10). One fold is left out, and the model is fitted on the remaining 𝑘 − 1 folds. The fitted model is then used to predict the responses in the omitted fold. The risk for that fold is estimated as:

where the sum is taken over all observations in the omitted fold. This procedure is repeated for each of the 𝑘 folds, and the overall risk estimate is obtained by averaging the 𝑘 individual risk values.
This method is particularly suitable when the primary goal of regression is prediction. In this setting, alternative performance measures [such as the Mean Absolute Error (MAE) or the Root Mean Squared Error (RMSE)] can also be used to evaluate predictive accuracy.

2.4 Other criteria

In the literature, in addition to the criteria discussed above, several other measures are commonly used for model selection. One widely used option is the adjusted coefficient of determination, defined as:

Another approach is to use nested model tests, such as the F-test. The F-test compares two nested models: a smaller model 𝑆₁, whose covariates form a subset of those in a larger model 𝑆₂. The null hypothesis states that the additional variables in 𝑆₂ do not significantly improve the fit relative to 𝑆₁.

Overall, the methods presented above primarily address the two central objectives of linear regression: parameter estimation and variable selection.

Having defined several ways to score a model, the remaining question is how to search the set of candidates to find the model with the best score.

3. Selection Procedure

Once models can be scored, the next step is to search either the entire space of possible models or a selected subset to identify the one with the best score. With 𝑘 covariates, there are 2^k−1 possible models [a number that quickly becomes impractical for large 𝑘 (for instance, more than one million models when 𝑘 = 20]. In such cases, exhaustive search is computationally infeasible, and heuristic methods are preferred. Broadly, model selection strategies fall into two categories: exhaustive search and stepwise search.

3.1 Exhaustive Search

This approach evaluates every possible model and selects the one with the best score. It is feasible only when 𝑘 is small, as the computational burden becomes prohibitive with a large number of covariates.

3.2 Stepwise Search

Stepwise methods aim to identify a local optimum—a model that performs better than its immediate neighbors. These methods are generally recommended only when exhaustive search is not feasible (e.g., when both 𝑛 and 𝑝 are large).

3.2.1 Forward Stepwise Selection

Choose a scoring criterion (e.g., AIC, BIC, Mallows’ 𝐶ₚ).
Start with an empty model.
At each step, add the variable that provides the greatest improvement in the criterion.
Continue until no variable improves the score or all variables are included in the model.

3.2.2 Backward Stepwise Selection

Choose a scoring criterion (e.g., AIC, BIC, Mallows’ 𝐶ₚ).
Start with the full model containing all variables.
At each step, remove the variable whose removal yields the greatest improvement in the criterion.
Continue until no further improvement is possible or only the essential variables remain.

3.2.3 Stepwise Selection (Mixed Method)

Choose a scoring criterion (e.g., AIC, BIC, Mallows’ 𝐶ₚ).
Start with an empty model and add variables one at a time, as in forward selection, until no variable further improves the score.
Then proceed as in backward selection, removing variables one at a time if doing so improves the criterion.
Stop when no additional improvement can be achieved or when all variables are included.

The next section is how to apply it on real data.

4. Application

In practice, before applying model selection techniques, it is essential to ensure that the covariates are not highly correlated. The following procedure can be applied:

Preliminary filtering: Remove covariates that are clearly irrelevant to the response variable (based on expert judgment, treatment of missing values, etc.).
Correlation with the response variable: Define a threshold for the correlation between each covariate and the response variable (e.g., 0.6). Covariates below this threshold may be excluded. (Here, we will not apply this filter to retain a sufficient number of covariates for selection.)
Correlation among covariates: Define a threshold for pairwise correlations between covariates (e.g., 0.7). Compute the correlation matrix; if two covariates exceed the threshold, keep the one with the strongest correlation with the response variable or the one with greater interpretability from a domain perspective.
Variance Inflation Factor (VIF): Compute the VIF for all remaining covariates. If a covariate’s VIF exceeds 5 or 10, it is considered highly collinear with others and should be removed.
Model selection: Apply the chosen model selection techniques. In this case, we will use Mallows’ 𝐶ₚ as the scoring criterion and backward stepwise selection as the variable selection method.

Finally, we will implement a stepwise selection procedure that can incorporate all the criteria discussed above (AIC, BIC, Mallows’ 𝐶ₚ, etc.) under either forward or backward strategies. This unified approach will allow us to compare models and select the one that best balances goodness of fit and complexity.

To illustrate the procedure, let us now present the dataset that will be used for the analysis.

4.1 Presentation of the Dataset

We use the Communities and Crime dataset from the UCI Machine Learning Repository, which contains socio-economic and demographic information about U.S. communities. The dataset includes more than 100 variables. The response variable is the number of violent crimes per population (violentCrimesPerPop). Our goal is to apply the model selection techniques discussed above to identify the covariates most strongly associated with this response.

4.2 Handling Missing Values

For this analysis, we remove all rows containing missing values.
An alternative strategy would be to:

Drop variables with a high proportion of missingness (e.g., >10%), and
Assess whether the remaining missing values are Missing At Random (MAR), Missing Completely at Random (MCAR), or Missing Not at Random (MNAR), applying an appropriate imputation method if necessary.

Here, however, we adopt the simpler approach of discarding all incomplete rows. After this step, the dataset contains no missing values and includes 103 variables in total: the response variable violentCrimesPerPop plus the covariates.

4.3 Selection of Relevant Variables Using Expert Judgment

We then apply expert judgment to assess the relevance of each variable and determine whether its correlation with the response is meaningful. This requires consultation between the statistician and domain experts to understand the context and importance of each covariate.

For this dataset, we remove:

communityname (a categorical variable with many levels), and
fold (a technical variable used only for cross-validation).

After this filtering step, we retain 101 variables: the response violentCrimesPerPop and 100 covariates.

4.4 Reducing Covariates Using a Correlation Threshold

To further reduce dimensionality, we compute the correlation matrix of the covariates and the response. When several covariates are highly correlated with each other (correlation > 0.6), we retain only the one with the strongest correlation to the response. This procedure reduces redundancy while mitigating multicollinearity.

After applying this filtering and computing the Variance Inflation Factor (VIF), we retain a final set of 19 covariates with VIF values below 5.

**Table 1 : Variance inflation factor of covariates.**

These preprocessing steps are explained in greater detail in my article: Feature Selection. Now, let us apply our selection procedure to identify the most relevant variables

4.5 Model Selection with Stepwise Selection

With 19 variables, the total number of possible models is 2¹⁹-1 = 524,287, which can be computationally infeasible for many systems. To reduce the search space, we use a stepwise selection procedure. We implement a function, stepwise_selection, that identifies the most relevant variables based on a chosen selection criterion and method (forward, backward, or mixed). In this example, we use Mallows’ 𝐶ₚ as the selection criterion and apply both forward and backward stepwise selection methods.

4.5.1 Backward Stepwise Selection Using Mallows’ 𝐶ₚ

Applying backward selection with Mallows’ 𝐶ₚ, we proceed as follows:

Step 1: Remove pctWFarmSelf. Its exclusion reduces the criterion to 𝐶ₚ = 41.74, lower than the full model.
Step 2: Remove PctWOFullPlumb. This further decreases 𝐶ₚ to 41.69895.
Step 3: Remove indianPerCap. The criterion is reduced again to 𝐶ₚ = 41.66073.

In total, three variables are removed, yielding the final model.

4.5.2 Forward Stepwise Selection Using Mallows’ 𝐶ₚ

Forward stepwise selection is generally recommended when the number of variables is large, since it is less computationally demanding than backward selection. Starting from an empty model, variables are added sequentially, one at a time, according to the criterion improvement.

In this example, forward selection identifies the same set of variables as backward selection. The Figure 1 below illustrates the sequence of variables added to the model, along with their corresponding 𝐶ₚ values. The process begins with PctKids2Par, followed by PctWorkMom, LandArea, and continues until the final model is reached, achieving a criterion value of 𝐶ₚ = 41.66.

**Figure 1. Variable selection based on Mallows’ Cp criterion. The corresponding Python implementation is provided in the appendix.**

Warning! This does not yet address the question of which variables are causes of the independent variable.

Conclusion

In this article, we addressed the question of model selection. The core principle of the procedure is to assign a score to each model in order to measure its quality, and then to search through the set of possible models to identify the one with the best score. This score is defined by balancing both the quality of fit and the complexity of the model.

Among the available procedures, we presented the stepwise forward and backward methods, which we implemented in Python. We applied them using different evaluation criteria: AIC, BIC, and Mallows’ C_P.

These methods, however, have a limitation: they explore a subset of all possible models. As a result, the selected models may sometimes represent oversimplifications of reality. Nevertheless, they remain very useful when the number of variables is large and exhaustive approaches become computationally too expensive.

Finally, when dealing with regression for predictive purposes, it is essential to split the dataset into two parts: training and test sets. Variable selection must be carried out exclusively on the training set; and never on the test set in order to ensure an honest evaluation of the model’s predictive performance.

Image Credits

All images and visualizations in this article were created by the author using Python (pandas, matplotlib, seaborn, and plotly) and excel, unless otherwise stated.

References

Wasserman, L. (2013). All of statistics: a concise course in statistical inference. Springer Science & Business Media.

Redmond, M. (2002). Communities and Crime [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C53W3X.

Cornillon, P. A., Hengartner, N., Matzner-Løber, E., & Rouvière, L. (2023). Régression avec R: 3ème édition. In Régression avec R. EDP sciences.

Data & Licensing

The dataset used in this article is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

This license allows anyone to share and adapt the dataset for any purpose, including commercial use, provided that proper attribution is given to the source.

For more details, see the official license text: CC BY 4.0.

Disclaimer

Any remaining errors or inaccuracies are the author’s responsibility. Feedback and corrections are welcome.

Codes

import numpy as np
import statsmodels.api as sm

def compute_score(y, X, vars_to_test, metric, full_model_mse=None):
    X_train = sm.add_constant(X[vars_to_test])
    model = sm.OLS(y, X_train).fit()
    n = len(y)
    p = len(vars_to_test) + 1  # +1 pour la constante

    if metric == 'AIC':
        return model.aic

    elif metric == 'BIC':
        return model.bic

    elif metric == 'Cp':
        if full_model_mse is None:
            raise ValueError("full_model_mse doit être fourni pour calculer Cp Mallows.")
        rss = sum(model.resid ** 2)
        return rss + 2 * p * full_model_mse

    elif metric == 'R2_adj':
        return -model.rsquared_adj  # négatif pour maximiser

    else:
        raise ValueError("Métrique inconnue. Utilisez 'AIC', 'BIC', 'Cp' ou 'R2_adj'.")

def get_best_candidate(y, X, selected, candidates, metric, strategy, full_model_mse=None):
    scores_with_candidates = []
    for candidate in candidates:
        vars_to_test = selected + [candidate] if strategy == 'forward' else [var for var in selected if var != candidate]
        score = compute_score(y, X, vars_to_test, metric, full_model_mse)
        scores_with_candidates.append((score, candidate, vars_to_test))

    scores_with_candidates.sort()
    print("Suppressions testées:", [(v, round(s, 2)) for s, v, _ in scores_with_candidates])
    return scores_with_candidates[0] if scores_with_candidates else (None, None, None)

def stepwise_selection(df, target, strategy='forward', metric='AIC', verbose=True):
    if df.isnull().values.any():
        raise ValueError("Des valeurs manquantes sont présentes dans le DataFrame.")

    X = df.drop(columns=[target])
    y = df[target]
    variables = list(X.columns)

    selected = [] if strategy == 'forward' else variables.copy()
    remaining = variables.copy() if strategy == 'forward' else []

    # Calcul préalable du MSE du modèle complet pour Cp Mallows
    if metric == 'Cp':
        X_full = sm.add_constant(X)
        full_model = sm.OLS(y, X_full).fit()
        full_model_mse = sum(full_model.resid ** 2) / (len(y) - len(variables) - 1)
    else:
        full_model_mse = None

    current_score = np.inf
    history = []
    step = 0

    while True:
        step += 1
        candidates = remaining if strategy == 'forward' else selected
        best_score, best_candidate, vars_to_test = get_best_candidate(y, X, selected, candidates, metric, strategy, full_model_mse)

        if best_candidate is None:
            if verbose:
                print("Aucun candidat disponible.")
            break

        if verbose:
            action = "ajouter" if strategy == 'forward' else "retirer"
            print(f"nÉtape {step}: Meilleure variable à {action} : {best_candidate} (score={round(best_score,5)})")


        improvement = best_score < current_score - 1e-6

        if improvement:
            if strategy == 'forward':
                selected.append(best_candidate)
                remaining.remove(best_candidate)
            else:
                selected.remove(best_candidate)

            current_score = best_score
            history.append({
                'step': step,
                'selected': selected.copy(),
                'score': current_score,
                'modified': best_candidate
            })
        else:
            if verbose:
                print("Aucune amélioration supplémentaire du score.")
            break

    X_final = sm.add_constant(X[selected])
    best_model = sm.OLS(y, X_final).fit()

    if verbose:
        print("nVariables sélectionnées :", selected)
        final_score = best_model.aic if metric == 'AIC' else best_model.bic
        if metric == 'Cp':
            final_score = compute_score(y, X, selected, metric, full_model_mse)
        elif metric == 'R2_adj':
            final_score = -compute_score(y, X, selected, metric)
        print(f"Score final ({metric}): {round(final_score,5)}")

    return selected, best_model, history

import matplotlib.pyplot as plt


def plot_stepwise_crosses(history, all_vars, metric="AIC", title=None):
    """
    Affiche le graphique stepwise type heatmap à croix :
    - X : variables explicatives modifiées à au moins une étape (ordre d'apparition)
    - Y : score (AIC/BIC) à chaque étape (de l'historique)
    - Croix noire : variable modifiée à chaque étape
    - Vide ailleurs
    - Courbe du score
    """
    n_steps = len(history)
    scores = [h['score'] for h in history]
    
    # Extraire la liste ordonnée des variables effectivement modifiées
    modified_vars = []
    for h in history:
        var = h['modified']
        if var not in modified_vars and var is not None:
            modified_vars.append(var)
    
    n_mod_vars = len(modified_vars)
    
    # Construction des positions X pour les croix (selon modified_vars)
    mod_pos = [modified_vars.index(h['modified']) if h['modified'] in modified_vars else None for h in history]

    fig, ax = plt.subplots(figsize=(min(1.3 * n_mod_vars, 8), 6))
    # Placer la croix noire à chaque étape
    for i, x in enumerate(mod_pos):
        if x is not None:
            ax.scatter(x, scores[i], color='black', marker='x', s=100, zorder=3)
    # Tracer la courbe du score
    ax.plot(range(n_steps), scores, color='gray', alpha=0.7, linewidth=2, zorder=1)
    # Axe X : labels verticaux, police réduite (uniquement variables modifiées)
    ax.set_xticks(range(n_mod_vars))
    ax.set_xticklabels(modified_vars, rotation=90, fontsize=10)
    ax.set_xlabel("Variables modifiées")
    ax.set_ylabel(metric)
    ax.set_title(title or f"Stepwise ({metric}) – Variables modifiées à chaque étape")
    ax.grid(True, axis='y', alpha=0.2)
    plt.tight_layout()
    plt.show()