Fighting Fraud Fairly: Upgrade Your AI Toolkit 🧰 | by César Ortega Quintero

Understanding the vehicle insurance fraud detection dataset

For this exercise, we will work with a publicly available vehicle insurance fraud detection dataset [31], which contains 15,420 observations and 33 features. The target variable, FraudFound_P, labels whether a claim is fraudulent, with 933 observations (5.98%) identified as fraud related. The dataset includes a range of potential predictors, such as:.

Demographic and policy-related features: gender, age, marital status, vehicle category, policy type, policy number, driver rating.
Claim-related features: day of week claimed, month claimed, days policy claim, witness present.
Policy-related features: deductible, make, vehicle price, number of endorsements.

Among these, gender and age are considered protected attributes, which means we need to pay special attention to how they may influence the model’s predictions. Understanding the dataset’s structure and identifying potential sources of bias are essential.

The business challenge

The goal of this exercise is to build a machine learning model to identify potentially fraudulent motor insurance claims. Fraud detection can significantly improve claim handling efficiency, reduce investigation costs, and minimize losses paid out on fraudulent claims. However, the dataset presents a significant challenge due to the high-class imbalance, with only 5.98% of the claims labeled as fraudulent.

In the context of fraud detection, false negatives (i.e., missed fraudulent claims) are particularly expensive, as they result in financial losses and investigation delays. To address this, we will prioritize the recall metric for identifying the positive class (FraudFound_P = 1). Recall measures the ability of the model to capture fraudulent claims, even at the expense of precision, ensuring that as many fraudulent claims as possible are identified and handled in a timely fashion by analysts in the fraud team.

Baseline model

Here, we will build the initial model for fraud detection using a set of predictors that include demographic and policy-related features, with an emphasis on the gender attribute. For the purposes of this exercise, the gender feature has explicitly been included as a predictor to intentionally introduce bias and force its appearance in the model, given that excluding it would result in a baseline model that is not biased. Moreover, in a real-world setting with a more comprehensive dataset, there are usually indirect proxies that may leak bias into the model. In practice, it is common for models to inadvertently use such proxies, leading to unwanted biased predictions, even when the sensitive attributes themselves are not directly included.

In addition, we excluded age as a predictor, aligning with the individual fairness approach known as “fairness through unawareness,” where we intentionally remove any sensitive attributes that could lead to discriminatory outcomes.

In the following image, we present the Classification Results, Distribution of Predicted Probabilities, and Lift Chart for the baseline model using the XGBoost classifier with a custom threshold of 0.1 (y_prob >= threshold) to identify predicted positive fraudulent claims. This model will serve as a starting point for measuring and mitigating bias, which we will explore in later sections.

Baseline model results. Image compiled by the author

Based on the classification results and visualizations presented below, we can see that the model reaches a Recall of 86%, which is in line with our business requirements. Since our primary goal is to identify as many fraudulent claims as possible, high recall is crucial. The model correctly identifies most of the fraudulent claims, even though the precision for fraudulent claims (17%) is lower. This trade-off is acceptable given that high recall ensures that the fraud investigation team can focus on most fraudulent claims, minimizing potential financial losses.

The distribution of predicted probabilities shows a significant concentration of predictions near zero, indicating that the model is classifying many claims as non-fraudulent. This is expected given the highly imbalanced nature of the dataset (fraudulent claims represent only 5.98% of the total claims). Moreover, the Lift Chart highlights that focusing on the top deciles provides significant gains in identifying fraudulent claims. The model’s ability to increase the detection of fraud in the higher deciles (with a lift of 3.5x in the 10th decile) supports the business objective of prioritizing the investigation of claims that are more likely to be fraudulent, increasing the efficiency of the efforts of the fraud detection team.

These results align with the business goal of improving fraud detection efficiency while minimizing costs associated with investigating non-fraudulent claims. The recall value of 86% ensures that we are not missing a large portion of fraudulent claims, while the lift chart allows us to prioritize resources effectively.

Measuring bias

Based on the XGBoost classifier, we evaluate the potential bias in our fraud detection model using binary metrics from the Holistic AI library. The code snippet below illustrates this.

from holisticai.bias.metrics import classification_bias_metrics
from holisticai.bias.plots import bias_metrics_report# Define protected attributes (group_a and group_b)
group_a_test = X_test['PA_Female'].values
group_b_test = X_test['PA_Male'].values
# Evaluate bias metrics with the custom threshold
metrics = classification_bias_metrics( group_a=group_a_test, group_b=group_b_test, y_pred=y_pred, y_true=y_test)
print("Bias Metrics with Custom Threshold: n", metrics)
bias_metrics_report(model_type='binary_classification', table_metrics=metrics)

Given the nature of the dataset and the business challenge, we focus on Equality of Opportunity metrics to ensure that individuals from both groups have equal chances of being correctly classified based on their true characteristics. Specifically, we aim to ensure that errors in prediction, such as false positives or false negatives, are distributed evenly across groups. This way, no group experiences disproportionately more errors than others, which is essential for achieving fairness in decision-making. For this exercise, we focus on the gender attribute (male and female), which is intentionally included as a predictor in the model to assess its impact on fairness.

Bias metrics report of baseline model. Image compiled by the author

The Equality of Opportunity bias metrics generated using a custom threshold of 0.1 for classification are presented below.

Equality of Opportunity Difference: -0.126
This metric directly evaluates whether the true positive rate is equal across the groups. A negative value suggests that females are slightly less likely to be correctly classified as fraudulent compared to males, indicating a potential bias favoring males in correctly identifying fraud.
False Positive Rate Difference: -0.076
The False Positive Rate difference is within the fair interval [-0.1, 0.1], indicating no significant disparity in the false positive rates between groups.
Average Odds Difference: -0.101
Average odds difference measures the balance of true positive and false positive rates across groups. A negative value here suggests that the model might be slightly less accurate in identifying fraudulent claims for females than for males.
Accuracy Difference: 0.063
The Accuracy difference is within the fair interval [-0.1, 0.1], indicating minimal bias in overall accuracy between groups.

There are small but significant disparities in Equality of Opportunity and Average Odds Difference, with females being slightly less likely to be correctly classified as fraudulent. This suggests a potential area for improvement, where further steps could be taken to reduce these biases and enhance fairness for both groups.

As we proceed in the next sections, we’ll explore techniques for mitigating this bias and improving fairness, while striving to maintain model performance.

Mitigating bias

In the effort to mitigate bias from the baseline model, the binary mitigation algorithms included in the Holistic AI library were tested. These algorithms used can be categorized into three types:

Pre-processing methods aim to modify the input data such that any model trained on it would no longer exhibit biases. These methods adjust the data distribution to ensure fairness before training begins. The algorithm evaluated were, Correlation Remover, Disparate Impact Remover, Learning Fair Representations and Reweighing.
In-processing methods alter the learning process itself, directly influencing the model during training to ensure fairer predictions. These methods aim to achieve fairness during the optimization process. The algorithm evaluated were, Adversarial Debiasing, Exponentiated Gradient, Grid Search Reduction, Meta Fair Classifier, and Prejudice Remover.
Post-processing methods adjust the model’s predictions after it has been trained, ensuring that the final predictions satisfy some statistical measure of fairness. The algorithm evaluated were, Calibrated Equalized Odds, Equalized Odds, LP Debiaser, ML Debiaser, and Reject Option.

The results from applying the various mitigation algorithms, focusing on key performance and fairness metrics are presented in the accompanying table.

Comparison table of mitigation algorithm results. Table compiled by the author

While none of the algorithms tested outperformed the baseline model, the Disparate impact remover (a pre-processing method) and Equalized odds (a post-processing method) showed promising results. Both algorithms improved the fairness metrics significantly, but neither produced results as close to the baseline model’s performance as expected. Moreover, I found that adjusting the threshold for Disparate impact remover and Equalized odds facilitated matching baseline performance while keeping equality of opportunity bias metrics within the fair interval.

Following academic recommendations stating that post-processing methods can be substantially sub-optimal (Woodworth et al., 2017)[32], in that they impact on the model after it was learned and can lead to higher performance degradation when compared to other methods (Ding et al., 2021)[33], I decided to prioritize the Disparate Impact Remover pre-processing algorithm over the post-processing Equalized Odds method. The code snippet below illustrates this process.

from holisticai.bias.mitigation import (AdversarialDebiasing, ExponentiatedGradientReduction, GridSearchReduction, MetaFairClassifier, 
PrejudiceRemover, CorrelationRemover, DisparateImpactRemover, LearningFairRepresentation, Reweighing, 
CalibratedEqualizedOdds, EqualizedOdds, LPDebiaserBinary, MLDebiaser, RejectOptionClassification)
from holisticai.pipeline import Pipeline# Step 1: Define the Disparate Impact Remover (Pre-processing)
mitigator = DisparateImpactRemover(repair_level=1.0)  # Repair level: 0.0 (no change) to 1.0 (full repair)
# Step 2: Define the XGBoost model
model = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
# Step 3: Create a pipeline with Disparate Impact Remover and XGBoost
pipeline = Pipeline(steps=[
('scaler', StandardScaler()),  # Standardize the data
('bm_preprocessing', mitigator),  # Apply bias mitigation
('estimator', model)  # Train the XGBoost model
])
# Step 4: Fit the pipeline
pipeline.fit(
X_train_processed, y_train, 
bm__group_a=group_a_train, bm__group_b=group_b_train  # Pass sensitive groups
)
# Step 5: Make predictions with the pipeline
y_prob = pipeline.predict_proba(
X_test_processed, 
bm__group_a=group_a_test, bm__group_b=group_b_test
)[:, 1]  # Probability for the positive class
# Step 6: Apply a custom threshold
threshold = 0.03
y_pred = (y_prob >= threshold).astype(int)

We further customized the Disparate impact remover algorithm by lowering the probability threshold, aiming to improve model fairness while maintaining key performance metrics. This adjustment was made to explore the potential impact on both model performance and bias mitigation.

Results of de-biased model (Disparate Impact Remover). Image compiled by the author

The results show that by adjusting the threshold from 0.1 to 0.03, we significantly improved recall for fraudulent claims (from 0.528 in the baseline to 0.863), but at the cost of precision (which dropped from 0.225 to 0.172). This aligns with the business objective of minimizing undetected fraudulent claims, despite a slight increase in false positives. The tradeoff is adequate: reducing the threshold increases the model’s sensitivity (higher recall) but leads to more false positives (lower precision). However, the overall accuracy of the model is only slightly impacted (from 0.725 to 0.716), reflecting the broader tradeoff between recall and precision that often accompanies threshold adjustments in imbalanced datasets like fraud detection.

Bias metrics report for de-biased model (Disparate Impact Remover). Image compiled by the author

The equality of opportunity bias metrics show minimal impact after adjusting the threshold to 0.03. The Equality of opportunity difference remains within the fair interval at -0.070, indicating that the model still provides equal chances of being correctly classified for both groups. The False positive rate difference of -0.041 and the Average odds difference of -0.056 both stay within the acceptable range, suggesting no significant bias favoring one group over the other. The Accuracy difference of 0.032 also remains small, confirming that the model’s overall accuracy is not disproportionately affected by the threshold adjustment. These results demonstrate that the fairness of the model, in terms of equality of opportunity, is well-maintained even with the threshold change.

Moreover, adjusting the probability threshold is necessary when working with imbalanced datasets such as fraud detection. The distribution of predicted probabilities will change with each mitigation strategy applied, and thresholds should be reviewed and adapted accordingly to balance both performance and fairness, as well as other dimensions not considered in this article (e.g., explainability or privacy). The choice of threshold can significantly influence the model’s behavior, and final decisions should be carefully adjusted based on business needs.

In conclusion, the Disparate impact remover with a threshold of 0.03 offers a reasonable compromise, improving recall for fraudulent claims while maintaining fairness in equality of opportunity metrics. This strategy aligns with both business objectives and fairness considerations, making it a viable approach for mitigating bias in fraud detection models.

Fighting Fraud Fairly: Upgrade Your AI Toolkit 🧰 | by César Ortega Quintero | Jan, 2025

Understanding the vehicle insurance fraud detection dataset

The business challenge

Baseline model

Measuring bias

Mitigating bias

Leave a Reply Cancel reply

Fighting Fraud Fairly: Upgrade Your AI Toolkit 🧰 | by César Ortega Quintero | Jan, 2025

Understanding the vehicle insurance fraud detection dataset

The business challenge

Baseline model

Measuring bias

Mitigating bias

Related Posts

How AI Agents Services Are Transforming Business Operations?

White House Announces $500B AI Data Center Infrastructure Build-Out

Leave a Reply Cancel reply