algorithms assume you’re working with completely unlabeled data.
But if you’ve actually worked on these problems, you know the reality is often different. In practice, anomaly detection tasks often come with at least a few labeled examples, maybe from past investigations, or your subject matter expert flagged a couple of anomalies to help you define the problem more clearly.
In these situations, if we ignore these valuable labeled examples and stick with those purely unsupervised methods, we are leaving money on the table.
So the question is, how can we actually make use of those few labeled anomalies?
If you search the academic literature, you will find it is full of clever solutions, especially with all the new deep learning methods coming out. But let’s be real, most of those solutions require adopting entirely new frameworks with steep learning curves. They usually involve a painful amount of unintuitive hyperparameter tuning, and still might not perform well on your specific dataset.
In this post, I want to share three practical strategies that you can start using right away to boost your anomaly detection performance. No fancy frameworks required. I’ll also walk through a concrete example on fraud detection data so you can see how one of these approaches plays out in practice.
By the end, you’ll have several actionable methods for making better use of your limited labeled data, plus a real-world implementation you can adapt to your own use cases.
1. Threshold Tuning
Let’s start with the lowest-hanging fruit.
Most unsupervised models output a continuous anomaly score. It’s entirely up to you to decide where to draw the line to distinguish the “normal” and “abnormal” classes.
This is an important step for a practical anomaly detection solution, as selecting the wrong threshold can result in either missing critical anomalies or overwhelming operators with false alarms. Luckily, those few labeled abnormal examples can provide some guidance in properly setting this threshold.
The key insight is that you can use those labeled anomalies as a validation set to quantify detection performance under different threshold choices.
Here’s how this works in practice:
Step (1): Proceed with your usual model training & thresholding on the dataset excluding those labeled anomalies. If you have curated a pure normal dataset, you might want to set the threshold as the maximum anomaly score observed in the normal data. If you are working with unlabeled data, you can set the threshold by choosing a percentile (e.g., 95th or 99th percentile) that corresponds to your tolerated false positive rate.
Step (2): With your labeled anomalies set aside, you can calculate concrete detection metrics under your chosen threshold. These include recall (what percentage of known anomalies would be caught), precision, and recall@k (useful when you can only investigate the top k alerts). These metrics give you a quantitative measure of whether your current threshold yields acceptable detection performance.
💡Pro Tip: If the number of your labeled anomalies is small, the estimated metrics (e.g., recall) would have high variances. A more robust way here would be to report its uncertainty via bootstrapping. Essentially, you are creating many “pseudo-datasets” by randomly sampling known anomalies with replacement, re-compute the metrics for every replicate, and derive the confidence interval from the distribution (e.g., grab the 2.5-th and 97.5-th percentiles, which gives you 95% confidence interval). Those uncertainty estimates would give you the hint of how trustworthy those computed metrics are.
Step (3): If you are not satisfied with the current detection performance, you can now actively tune the threshold based on these metrics. If your recall is too low (meaning that you’re missing too many known anomalies), you can lower the threshold. If you’re catching most anomalies but the false positive rate is higher than acceptable, you can raise the threshold and measure the trade-off. The bottom line is that you can now find the optimal balance between false positives and false negatives for your specific use case, based on real performance data.
✨ Takeaway
The strength of this approach lies in its simplicity. You’re not changing your anomaly detection algorithm at all – you’re just using your labeled examples to intelligently tune a threshold you would have had to set anyway. With a handful of labeled anomalies, you can turn threshold selection from guesswork into an optimization problem with measurable outcomes.
2. Model Selection
Besides tuning the threshold, the labeled anomalies can also guide the selection of better model choices and configurations.
Model selection is a common pain point every practitioner faces: with so many anomaly detection algorithms out there, each with their own hyperparameters, how do you know which combination will actually work well for your specific problem?
To effectively answer this question, we need a concrete way to measure how well different models and configurations perform on the dataset we are investigating.
This is exactly where those labeled anomalies become invaluable. Here’s the workflow:
Step (1): Train your candidate model (with a specific set of configurations) on the dataset, excluding those labeled anomalies, just like what we did with the threshold tuning.
Step (2): Score the entire dataset and calculate the average anomaly score percentile of your known anomalies. Specifically, for each of the labeled anomalies, you calculate what percentile it falls into of the distribution of the scores (e.g., if the score of a known anomaly is higher than 95% of all data points, it’s at the 95th percentile). Then, you average these percentiles across all your labeled anomalies. This way, you obtain a single metric that captures how well the model pushes known anomalies toward the top of the ranking. The higher this metric is, the better the model performs.
Step (3): You can apply this approach to identify the most promising hyperparameter configurations for a specific model type you have in mind (e.g., Local Outlier Factor, Gaussian Mixture Models, Autoencoder, etc.), or to select the model type that best aligns with your anomaly patterns.
💡Pro Tip: Ensemble learning is increasingly common in production anomaly detection systems. This paradigm means instead of relying on one single detection model, multiple detectors, possibly with different model types and different model configurations, run simultaneously to catch different types of anomalies. In this case, those labeled abnormal samples can help you gauge which candidate model instance actually deserve a spot in your final ensemble.
✨ Takeaway
Compared to the previous threshold tuning strategy, this current model selection strategy moves from “tuning what you have” to “choosing what to use.”
Concretely, by using the average percentile ranking of your known anomalies as a performance metric, you can objectively compare different algorithms and configurations in terms of how well they identify the types of anomalies you actually encounter. As a result, your model selection is no longer a trial-and-error process, but a data-driven decision-making process.
3. Supervised Ensembling
So far, we’ve been discussing strategies where the labeled anomalies are primarily used as a validation tool, either for tuning the threshold or selecting promising models. We can, of course, put them to work more directly in the detection process itself.
This is where the idea of supervised ensembling comes in.
To better understand this approach, let’s first discuss the intuition behind this strategy.
We know that different anomaly detection methods often disagree about what looks suspicious. One algorithm might flag “anomaly” at a data point while another might say it’s totally normal. But here’s the thing: these disagreements are quite informative, as they tell us a lot about that data point’s anomaly signature.
Let’s consider the following scenario: Suppose we have two data points, A and B. For data point A, it triggers alarms in a density-based method (e.g., Gaussian Mixture Models) but passes through an isolation-based one (e.g., Isolation Forest). For data point B, however, both detectors set off the alarm. Then, we would generally believe those two points carry completely different signatures, right?
Now the question is how to capture these signatures in a systematic way.
Luckily, we can resort to supervised learning. Here is how:
Step (1): Start by training multiple base anomaly detectors on your unlabeled data (excluding your precious labeled examples, of course).
Step (2): For each data point, collect the anomaly scores from all these detectors. This becomes your feature vector, which is essentially the “anomaly signatures” we aim to mine from. To give a concrete example, let’s say you used three base detectors (e.g., Isolation Forest, GMM, and PCA), then the feature vector for a single data point i
would look like this:
X_i=[iForest_score, GMM_score, PCA_score]
The label for each data point is straightforward: 1
for the known anomalies and 0
for the rest of the samples.
Step (3): Train a standard supervised classifier using these newly composed feature vectors as inputs and the labels as the target outputs. Although any off-the-shelf classification algorithm could in principle work, a common recommendation is to use gradient-boosted tree models, such as XGBoost, as they are adept at learning complex, non-linear patterns in the features, and they are robust against the “noisy” labels (keep in mind that probably not all the unlabeled samples are normal).
Once trained, this supervised “meta-model” is your final anomaly detector. At inference time, you run new data through all base detectors and feed their outputs to your trained meta-model for the final decision, i.e., normal or abnormal.
✨ Takeaway
With the supervised ensembling strategy, we are shifting the paradigm from using the labeled anomalies as passive validation tools to making them active participants in the detection process. The meta-classifier model we built learns how different detectors respond to anomalies. This not only improves detection accuracy, but more importantly, gives us a principled way to combine the strengths of multiple algorithms, making the anomaly detection system more robust and reliable.
If you’re thinking of implementing this strategy, the good news is that the PyOD library already provides this functionality. Let’s take a look at it next.
4. Case Study: Fraud Detection
In this section, let’s go through a concrete case study to see the supervised ensemble strategy in action. Here, we consider a method called XGBOD (Extreme Gradient Boosting Outlier Detection), which is implemented in the PyOD library.
For the case study, we consider a credit card fraud detection dataset (Database Contents License) from Kaggle. This dataset contains transactions made by credit cards in September 2013 by European cardholders. In total, there are 284,807 transactions, 492 of which are frauds. Note that due to confidentiality issues, the features presented in the dataset are not original, but are the result of a PCA transformation. Feature ‘Class’ is the response variable. It takes the value 1 in case of fraud and 0 otherwise.
In this case study, we consider three learning paradigms, i.e., unsupervised learning, XGBOD, and fully supervised learning, for performing anomaly detection. We will vary the “supervision ratio” (percentage of anomalies that are available during training) for both XGBOD and the supervised learning approach to see the effect of leveraging labeled anomalies on the detection performance.
4.1 Import Libraries
For unsupervised anomaly detection, we consider 4 algorithms: Principal Component Analysis (PCA), Isolation Forest, Cluster-based Local Outlier Factor (CBLOF), and Histogram-based Outlier Detection (HBOS), which is an efficient detection method that assumes feature independence and calculates the degree of outlyingness by building histograms. All algorithms are implemented in the PyOD library.
For the supervised learning approach, we use an XGBoost classifier.
import pandas as pd
import numpy as np
# PyOD imports
# !pip install pyod
from pyod.models.xgbod import XGBOD
from pyod.models.pca import PCA
from pyod.models.iforest import IForest
from pyod.models.cblof import CBLOF
from pyod.models.hbos import HBOS
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (precision_recall_curve, average_precision_score,
roc_auc_score)
# !pip install xgboost
from xgboost import XGBClassifier
4.2 Data Preparation
Remember to download the dataset from Kaggle and store it locally under the name “creditcard.csv”.
# Load data
df = pd.read_csv('creditcard.csv')
X, y = df.drop(columns='Class').values, df['Class'].values
# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Split into train/test
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y, test_size=0.3, random_state=42, stratify=y
)
print(f"Dataset shape: {X.shape}")
print(f"Fraud rate (%): {y.mean()*100:.4f}")
print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
Here, we create a helper function to generate labeled data for XGBOD/XGBoost learning.
def create_supervised_labels(y_train, supervision_ratio=0.01):
"""
Create supervised labels based on supervision ratio.
"""
fraud_indices = np.where(y_train == 1)[0]
n_labeled_fraud = int(len(fraud_indices) * supervision_ratio)
# Randomly select labeled samples
labeled_fraud_idx = np.random.choice(fraud_indices,
n_labeled_fraud,
replace=False)
# Create labels
y_labels = np.zeros_like(y_train)
y_labels[labeled_fraud_idx] = 1
# Calculate how many true frauds are in the "unlabeled" set
unlabeled_fraud_count = len(fraud_indices) - n_labeled_fraud
return y_labels, labeled_fraud_idx, unlabeled_fraud_count
Note that this function mimics the realistic scenario where we have a few known anomalies (labeled as 1), while all other unlabeled samples are treated as normal (labeled as 0). This means our labels are effectively noisy, since some true fraud cases are hidden among the unlabeled data but still receive a label of 0.
Before we start our analysis, let’s define a helper function for evaluating model performance:
def evaluate_model(model, X_test, y_test, model_name):
"""
Evaluate a single model and return metrics.
"""
# Get anomaly scores
scores = model.decision_function(X_test)
# Calculate metrics
auc_pr = average_precision_score(y_test, scores)
return {
'model': model_name,
'auc_pr': auc_pr,
'scores': scores
}
In PyOD framework, every trained model instance exposes a decision_function()
method. By calling it on the inference samples, we can obtain the corresponding anomaly scores.
For comparing performance, we use AUCPR, i.e., the area under the precision-recall curve. As we are dealing with a highly imbalanced dataset, AUCPR is generally preferred over AUC-ROC. Additionally, using AUCPR eliminates the need for an explicit threshold to measure model performance. This metric already incorporates model performance under various threshold conditions.
4.3 Unsupervised Anomaly Detection
models = {
'IsolationForest': IForest(random_state=42),
'CBLOF': CBLOF(),
'HBOS': HBOS(),
'PCA': PCA(),
}
for name, model in models.items():
print(f"Training {name}...")
model.fit(X_train)
result = evaluate_model(model, X_test, y_test, name)
print(f"{name:20} - AUC-PR: {result['auc_pr']:.4f}")
The results we obtained are as follows:
IsolationForest: – AUC-PR: 0.1497
CBLOF: – AUC-PR: 0.1527
HBOS: – AUC-PR: 0.2488
PCA: – AUC-PR: 0.1411
With zero hyperparameter tuning, none of the algorithms delivered very promising results, as their AUCPR values (~0.15–0.25) may fall short of the very high precision/recall often required in fraud-detection settings.
However, we should note that, unlike AUC-ROC, which has a baseline value of 0.5, the baseline AUCPR depends on the prevalence of the positive class. For our current dataset, since only 0.17% of the samples are fraud, a naive classifier that guesses randomly would have an AUCPR ≈ 0.0017. In that sense, all detectors already outperform random guessing by a wide margin.
4.4 XGBOD Approach
Now we move to the XGBOD approach, where we will leverage a few labeled anomalies to inform our anomaly detection.
supervision_ratios = [0.01, 0.02, 0.05, 0.1, 0.15, 0.2]
for ratio in supervision_ratios:
# Create supervised labels
y_labels, labeled_fraud_idx, unlabeled_fraud_count = create_supervised_labels(y_train, ratio)
total_fraud = sum(y_train)
labeled_fraud = sum(y_labels)
print(f"Known frauds (labeled as 1): {labeled_fraud}")
print(f"Hidden frauds in 'normal' data: {unlabeled_fraud_count}")
print(f"Total samples treated as normal: {len(y_train) - labeled_fraud}")
print(f"Fraud contamination in 'normal' set: {unlabeled_fraud_count/(len(y_train) - labeled_fraud)*100:.3f}%")
# Train XGBOD models
xgbod = XGBOD(estimator_list=[PCA(), CBLOF(), IForest(), HBOS()],
random_state=42,
n_estimators=200, learning_rate=0.1,
eval_metric='aucpr')
xgbod.fit(X_train, y_labels)
result = evaluate_model(xgbod, X_test, y_test, f"XGBOD_ratio_{ratio:.3f}")
print(f"xgbod - AUC-PR: {result['auc_pr']:.4f}")
The obtained results are shown in the figure below, together with the performance of the best unsupervised detector (HBOS) as the reference.
We can see that with only 1% labeled anomalies, the XGBOD method already beats the best unsupervised detector, achieving an AUCPR score of 0.4. With more labeled anomalies becoming available for training, XGBOD’s performance continues to improve.
4.5 Supervised Learning
Finally, we consider the scenario where we directly train a binary classifier on the dataset with the labeled anomalies.
for ratio in supervision_ratios:
# Create supervised labels
y_label, labeled_fraud_idx, unlabeled_fraud_count = create_supervised_labels(y_train, ratio)
clf = XGBClassifier(n_estimators=200, random_state=42,
learning_rate=0.1, eval_metric='aucpr')
clf.fit(X_train, y_label)
y_pred_proba = clf.predict_proba(X_test)[:, 1]
auc_pr = average_precision_score(y_test, y_pred_proba)
print(f"XGBoost - AUC-PR: {auc_pr:.4f}")
The results are shown in the figure below, together with the XGBOD’s performance obtained from the previous section:

In general, we see that with only limited labeled data, the standard supervised classifier (XGBoost in this case) struggles to distinguish between normal and anomalous samples effectively. This is particularly evident when the supervision ratio is extremely low (i.e., 1%). While XGBoost’s performance improves as more labeled examples become available, we see that it remains consistently inferior to the XGBOD approach across the examined range of supervision ratios.
5. Conclusion
In this post, we discussed three practical strategies to leverage the few labeled anomalies to boost the performance of your anomaly detector:
- Threshold tuning: Use labeled anomalies to turn threshold setting from guesswork into a data-driven optimization problem.
- Model selection: Objectively compare different algorithms and hyperparameter settings to find what truly works well for your specific problems.
- Supervised ensembling: Train a meta-model to systematically extract the anomaly signatures revealed by multiple unsupervised detectors.
Furthermore, we went through a concrete case study on fraud detection and showed how the supervised ensembling method (XGBOD) dramatically outperformed both purely unsupervised models and standard supervised classifiers, especially when labeled data was scarce.
The key takeaway: a few labels go a long way in anomaly detection. Time to put those labels to work.