Home » Help Your Model Learn the True Signal

Help Your Model Learn the True Signal

loan default risk using features such as income and credit history. A few borrowers with relatively low incomes seem to repay large loans just fine, which could mislead the model. In reality, they had submitted their income in US dollars rather than your local currency, but this was missed during data entry, making them appear less creditworthy than they actually were.

Or you’re building a model to predict patient recovery times. Most patients follow expected recovery trajectories, but a few experienced very rare complications that weren’t recorded. These cases sit far from the rest in terms of the relationship between symptoms, treatments, and outcomes. They’re not necessarily “wrong,” but they are disruptive, causing the model to generalise poorly to the majority of future patients.

In both scenarios, the issue isn’t just noise or classic anomalies. The problem is more subtle:

Some observations disproportionately disrupt the model’s ability to learn the dominant signal.

These data points may:

  • Have a disproportionate influence on the learned parameters,
  • Come from unusual or unmodeled contexts (e.g., rare complications or data entry issues),
  • And most importantly, reduce the model’s ability to generalise.

A model’s trustworthiness and predictive accuracy can be significantly compromised by these data points that exert undue influence on its parameters or predictions. Understanding and effectively managing these influential observations is not merely a statistical formality, but a cornerstone of building models that are robust and reliable.

🎯 What I Seek to Achieve in This Article

In this article, I’ll walk you through a simple yet powerful technique to effectively identify and manage these disruptive data points, so the model can better capture the stable, generalizable patterns in the data. This method is algorithm-agnostic, making it directly adaptable to any algorithm or analytical framework you’ve chosen for your use case. I will also offer the full code for you to implement it easily.

Sounds good? Let’s get started.


Inspiration: Cook’s Distance, Reimagined

Cook’s Distance is a classic diagnostic tool from linear regression. It quantifies how much a single data point influences the model by:

  • Training the model on the full dataset
  • Retraining it with one observation left out
  • Measuring how much the predictions change by calculating the sum of all the changes in the regression model with vs without the observation, using the formula below:

A large Cook’s Distance means that an observation has high influence and is possibly distorting the model, and should be checked for validity.

Why Cook’s D?

The Cook ‘s-D influence approach is uniquely suited for identifying data points that distort a model’s learned patterns, a gap often left by other outlier detection techniques.

  • Univariate Detection: Univariate methods (like Z-scores or IQR rules) identify extreme values within individual features or the target variable alone. However, points that significantly influence a complex model’s prediction may appear perfectly ordinary when each of their features is examined in isolation. They are “outliers” not by their individual values, but by their relationship to the overall data and the model’s structure.
  • Feature-Focused Anomaly Detection: Techniques such as Isolation Forest or Local Outlier Factor (LOF) excel at detecting anomalies purely based on the distribution and density of input features (X). While valuable for identifying unusual data entries, they inherently do not consider the role of the target variable (Y) or how a model uses features to predict it. Consequently, a data point flagged as an outlier in the feature space might not necessarily have a disproportionate impact on your model’s predictive performance or overall learned pattern. Conversely, a point not flagged by these methods might still be highly influential on the model’s predictions.
  • Standard Residual-Based Methods: Residuals, the difference between actual and predicted values, highlight where the model performs poorly. While this indicates a deviation, it doesn’t distinguish between whether the point is simply noisy (e.g., unpredictable but harmless) or truly disrupting, that is, “pulling” the model’s entire predictive surface away from the general pattern established by the majority of the data. We could have points with high residuals but little influence, or those with moderate residuals but disproportionately warp the model’s predictions.

This is where a Cook’s-D-style influence metric truly shines. It goes beyond the size of the prediction error to ask:

How structurally destabilizing is a single data point to the entire model’s learned relationships?

Such an approach enables us to surgically identify and manage data points that disproportionately pull the model’s predictions away from the “general pattern” reflected in the rest of the data.

This is crucial when robustness and generalisation are paramount yet hard to guarantee — for example, in diagnostic tools where a few unusual patient records could bias predictions for the wider population, or in fraud detection modelling, where the training set contains false negatives because not every transaction or claim has been audited.

In essence, while other methods help us find “weird” data, the Cook’s-like approach helps us find data points that make our model itself “weird” in its overall behaviour.


The Algorithm-Agnostic Adaptation of Cook’s D

Powerful as it is, this classic technique has its limitations:

  • The original formula applies directly only to Ordinary Least Squares (OLS) regression, and
  • For large datasets, it becomes computationally expensive because it requires repeated model fitting

But the underlying logic is much broader. Following Cook’s idea, one can extend this foundational concept to any machine learning algorithm.

The Metric

The Core Idea: At its heart, this approach asks:

🔬If we remove a single data point from the training set and re-train the model, how much do the predictions for all data points change compared to when that point was included?

Extensions beyond OLS: Researchers have developed modified versions of Cook’s D for other contexts. For example:

  • Generalised Cook’s Distance for GLMs (e.g., logistic regression, Poisson regression), which redefines leverage and residuals in terms of the model’s score and information matrix.
  • Cook’s Distance for linear mixed models, which accounts for both fixed and random effects.

Algorithm-agnostic approach: Here, we aim to adapt Cook’s core principle to work with any machine learning model, with a workflow like this:

  • Train your chosen model (e.g., LightGBM, Random Forest, Neural Network, Linear Regression, etc) on the full dataset and record predictions.
  • For each data point in the dataset:
    • LOO (Leave-One-Out): Remove the data point to create a new dataset.
    • Retrain the model on this reduced dataset.
    • Predict outcomes for all observations in the original dataset.
    • Measure divergence between the two sets of predictions. A direct analogue to Cook’s Distance is the mean squared difference in predictions

Address the Computational Challenge

Another limitation of this powerful metric is its computational cost, as it requires N full model retrainings. For large datasets, this can be prohibitively expensive.

To make the method practical, we can make a strategic compromise: instead of processing every single observation, we can focus on a subset of data points. These points can be selected based on their high absolute residuals when predicted by the initial full model. This effectively focuses the computationally intensive step on the most likely influential candidates.

💡Pro Tip: Add a max_loo_points (integer, optional) parameter to your implementation. If specified, the LOO calculation is performed only for these many data points. This provides a balance between thoroughness and computational efficiency.

Smart Detection of Influential Points

Once the influence scores have been calculated, let’s identify specific influential points that warrant further investigation and management. The detection strategy should adapt based on whether we’re working with the full dataset or a subset (when max_loo_points is set):

💡Pro Tip: Add influence_outlier_method and influence_outlier_threshold parameters to your implementation so its easy to specify the the most appropriate detection approach for each use case.

Full Dataset Analysis:

When analysing the complete dataset, the influence scores represent a comprehensive picture of each point’s impact on the model’s learned patterns. This allows us to leverage a wide variety of distribution-based detection methods:

  • Percentile Method (influence_outlier_method="percentile")
    • Selects points above a percentile threshold
    • Example: threshold=95 identifies points in the top 5% of influence scores
    • Good for: Maintaining a consistent proportion of influential points
  • Z-Score Method (influence_outlier_method="zscore"):
    • Selects points beyond N standard deviations from the mean
    • Example: threshold=3 flags points more than 3 standard deviations away
    • Good for: Normal or approximately normal distributions
  • Top K Method (influence_outlier_method="top_k"):   
    • Selects the K points with highest influence scores
    • Example: `threshold=50` selects the 50 most influential points
    • Good for: When you need a specific number of points to investigate.
  • IQR Method (influence_outlier_method="iqr"):
    • Selects points above Q3 + k * IQR threshold
    • Example: threshold=1.5 uses the standard boxplot outlier definition
    • Good for: Robust to outliers, works well with skewed distributions
  • Mean Multiple Method (influence_outlier_method="mean_multiple"):
    • Selects points with influence scores > N times the mean score
    • Example: threshold=3 implements the recommendation from the literature (e.g., Tranmer, Murphy, Elliot, & Pampaka, 2020)
    • Good for: Following established statistical practices, especially when using linear models

Subset Analysis:

For computational efficiency with large datasets, we can specify a max_loo_points value to analyze a subset of points:

  • Initial Filtering:
    • Uses absolute residuals to identify n = max_loo_points candidate points
    • Only these candidates are evaluated for their influence scores
    • Remaining points (with lower residuals) are implicitly considered non-influential
  • Available Methods:
    • Percentile: Select top percentage of points (capped at max_loo_points)
    • Top K: Select K most influential points (K ≤ max_loo_points)
    • Note: Other distribution-based methods (z-score, IQR) are not applicable here due to the pre-filtered nature of the scores.

This flexible approach allows users to choose the most appropriate detection method based on

  • the dataset size and computational constraints
  • Distribution characteristics of influence scores
  • Specific requirements for the number of points to investigate

Diagnostic Visuals

💡Pro Tip: The detection of influential observations should be seen as a starting point for investigation 🔍 rather than an automatic removal criterion 🗑️

Each flagged point deserves careful examination within the context of the specific use case. Some of these points may be high-leverage but valid representations of unusual phenomena — removing them could hurt performance. Others could be data errors or noise — these are the ones we’d want to filter out. To assist with decision-making on influential points, the code below provides comprehensive diagnostic visualisations to support the investigation:

  • Influence Score Distribution
    • Shows the distribution of influence scores across all points
    • Highlights the threshold used for flagging influential points
    • Helps assess if the influential points are clear outliers or part of a continuous spectrum
  • Target Distribution View
    • Shows the overall distribution of the target variable
    • Highlights influential points with distinct markers
    • Helps identify if influential points are concentrated in specific value ranges
  • Feature-Target Relationships
    • Creates scatter plots for each feature against the target
    • Automatically adapts visualisation for categorical features
    • Highlights influential points to reveal potential feature-specific patterns
    • Helps understand if influence is driven by specific feature values or combinations

These visualisations can guide several key decisions:

  • Whether to treat influential points as errors requiring removal
  • Whether to collect additional observations in similar regions so the model can learn to handle relevant influential points
  • Whether the influence patterns suggest underlying data quality issues
  • Whether the influential points represent valuable edge cases worth preserving
  • What’s the best method/threshold to filter out influential points for this use case based on influence score distribution.

All in all, the visual diagnostics, combined with domain expertise, enable more informed decisions about how to handle influential observations in your specific context.


Source Code & Demo

This approach, including all the functionalities discussed above, has been implemented as a utility function calculate_cooks_d_like_influence in the stats_utils module by the MLarena Python package, with the source code available on GitHub 🔗. Now let’s check out this function in action 😎.

Synthetic Data with Built-In Disruptors

I have created a synthetic dataset of housing prices as a function of age, size and number of bedrooms, then split it into train (n=800) and test (n=200). In the code below, I planted 50 disruptors into the training set like below (full code of the demo available in this notebook in the same repo).

# Plant different types of currency errors
n_disruptive = 50
for i, idx in enumerate(disruptive_indices):
    if i <= n_disruptive//2:  # Currency conversion error: prices too low 
        y_with_disruptors.iloc[idx] = y_with_disruptors.iloc[idx] * 0.5  # Much lower
    else:  # Currency conversion error: prices too high (different scale)
        y_with_disruptors.iloc[idx] = y_with_disruptors.iloc[idx] * 1.5  # Much higher
The distribution of the original dataset and new dataset with planted disruptive points.

Calculate the Influence Score

Now, let’s calculate the influence score for all observations in the training set. As discussed above, the calculate_cooks_d_like_influence function is an algorithm-agnostic solution; it accepts any sklearn-style regressor that provides the fit and predict methods. For example, in the code below, I passed in LinearRegression as the estimator.

from mlarena.utils.stats_utils import calculate_cooks_d_like_influence

influence_scores, influential_indices, normal_indices = calculate_cooks_d_like_influence(
    model_class = LinearRegression,
    X = X_with_disruptors,
    y = y_with_disruptors,
    visualize = True,
    influence_outlier_method = "percentile",
    influence_outlier_threshold = 95,  
    random_state = 42
)

In the code above, I have also set the method for influential points detection to be by percentile. Because the training set contains 800 samples, the 95% cutoff gave us 40 influential points. As shown in the Distribution of Influence Scores plot below, most observations cluster around small influence values, but a handful stand out with much larger scores. This is expected, since we deliberately planted 50 disruptors in the dataset. Follow-up analysis, available in the linked notebook in the repo, confirms that the top 50 most influential points align exactly with our 50 planted disruptors. 🥂

The top 5% high-influence points are highlighted in the Target Distribution plot below. Consistent with how we planted these disruptors, only some of these observations can be considered as univariate outliers.

The scatterplots below show the relationship between each feature and the target variable, with influential points highlighted in red. These diagnostic plots serve as powerful tools for analysing influential observations and shaping informed decisions about their treatment by facilitating discussions regarding key questions such as:

  1. Are these points rare but valid cases that should be preserved to maintain important edge cases?
  2. Do these points indicate areas where additional data collection would be beneficial to better represent the full range of scenarios?
  3. Do these points represent errors or outliers that, if removed, would help the model learn more generalizable patterns?

Focused Search and Easy Change of Algorithms

Next, let’s test the function on another algorithm, the LightGBM regressor. As shown in the code below, you can easily configure the algorithm via the model_params parameter.

In addition, by setting up max_loo_points, we can optimize the computation by focusing only on the most promising candidates. For example, instead of performing leave-one-out (LOO) analysis on all 800 training points, we can configure the function to intelligently select 200 points with the highest absolute residuals. This effectively targets the search to the ‘danger zone’ where influential points are most likely to be found.

You can also specify the method and threshold for identifying influential points that is the most suitable for your use case. In the code below, I chose the top_k method to identify the 50 most influential points based on their influence scores.

model_params={'verbose': -1, 'n_estimators': 50}

influence_scores, influential_indices, normal_indices = calculate_cooks_d_like_influence(
    model_class = lgb.LGBMRegressor,
    X = X_with_disruptors,
    y = y_with_disruptors,
    visualize = True,
    max_loo_points = 200,  # Focus on top n high-residual points
    influence_outlier_method = "top_k",
    influence_outlier_threshold = 50,  
    random_state = 42,
    **model_params
)

Retrain Using the Cleaned Data

After careful investigation of the influential points, say if you decide to remove them from your training set and retrain the model. Below is the code to get the cleaned training set using the normal_indices conveniently returned from calculate_cooks_d_like_influence function from the code cell above.

X_clean = X_with_disruptors.iloc[normal_indices]
y_clean = y_with_disruptors.iloc[normal_indices]

In addition, if you are interested in checking out the impact of the cleaning on different algorithms, you can swap algorithms easily using MLarena like below.

from mlarena import MLPipeline, PreProcessor

# Train model on the training set
pipeline = MLPipeline(
    model=lgb.LGBMRegressor(n_estimators=100, random_state=42, verbose=-1),
    # model = LinearRegression(), # swap algorithms easily
    # model = RandomForestRegressor(n_estimators=50, random_state=42), 
    preprocessor=PreProcessor()
)
pipeline.fit(X_train, y_train)

# Evaluate on test set
results = pipeline.evaluate(
    X_test, y_test
)

Comparison Across Algorithms

We can easily loop the workflow above over the disrupted and cleaned training set and across different algorithms. Pls see the performance comparisons in the following plot.

In our demo, Linear Regression shows the modest improvement primarily due to the linear nature of our synthetic data. In reality, it is always worthwhile experimenting with different algorithms to find the most suitable approach for your use case. The experimentation or migration between algorithms doesn’t need to be disruptive; more on the algorithm-agnostic ML workflow in this article 🔗.


There you have it, the helper function calculate_cooks_d_like_influence that you can conveniently add to your ML workflow to identify influential observations. While our demonstration used synthetic data with deliberately planted disruptors, real-world applications require much more nuanced investigation. The diagnostic visualisations provided by this function are designed to facilitate careful analysis and meaningful discussions about influential points.

  • Each influential point might represent a valid edge case in your domain
  • Patterns in influential points could reveal important gaps in your training data
  • The decision to remove or retain points should be based on domain expertise and business context

🔬 Think of this function as a diagnostic tool that highlights areas for investigation, not as an automatic outlier removal mechanism. Its true value lies in helping you understand your data better so your model can learn and generalise better 🏆.


I write about data, ML, and AI for problem-solving. You can also find me on 💼LinkedIn | 😺GitHub | 🕊️Twitter/ 🤗


Unless otherwise noted, all images are by the author.

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *