Explainable Anomaly Detection with RuleFit: An Intuitive Guide

your anomaly detection results to your stakeholders, the immediate next question is always “why?”.

In practice, simply flagging an anomaly is rarely enough. Understanding what went wrong is crucial to determining the best next action.

Yet, most machine learning-based anomaly detection methods stop at producing an anomaly score. They are black-box in nature, which makes it painful to make sense of their outputs-why does this sample have a higher anomaly score than its neighbors?

To tackle this explainability challenge, you may have likely already resorted to popular eXplainable AI (XAI) techniques. Perhaps you are calculating feature importance to identify which variables are driving the abnormality, or you are running counterfactual analysis to see how close a case was to normal.

These are useful, but what if you could do more? What if you can derive a set of interpretable IF-THEN rules that characterize the identified anomalies?

This is exactly what the RuleFit algorithm [1] promises.

In this post, we’ll explore how the RuleFit algorithm works intuitively, how it can be applied to explain detected anomalies, and walk through a concrete case study.

1. How Does It Work?

Before diving into the technical details, let’s first clarify what we aim to have after applying the algorithm: We want to have a set of IF-THEN rules that quantitatively characterize the abnormal samples, as well as the importance of those rules.

To get there, we need to answer two questions:

(1) How do we generate meaningful IF-THEN conditions from the data?

(2) How do we calculate the rule importance score to determine which ones actually matter?

The RuleFit algorithm addresses these questions by splitting the work into two complementary parts, the “Rule” and the “Fit”.

1.1 The “Rule” in RuleFit

In RuleFit, a rule looks like this:

IF x1 < 10 AND x2 > 5 THEN 1 ELSE 0

Would this structure look a bit more familiar if we visualize it like this:

Figure 1. A rule is just one specific path through a decision tree. (Image by author)

Yes, it is a decision tree! The rule here is just traversing one specific path through the tree, from the root node to the leaf node.

In RuleFit, the rule generation process heavily relies on building decision trees, which predict the target outcome given the input features. Once the tree is built, any path from the root to a node in a tree can be converted to a decision rule, as we have just seen in the example above.

To ensure the rules are diverse, RuleFit doesn’t just fit one decision tree. Instead, it leverages tree ensemble algorithms (e.g., random forest, Gradient Boosting trees, etc.) to generate many different decision trees.

Also, the depths of those trees are, in general, different. This brings the benefits of generating rules with variable lengths, further enhancing the diversity.

Here, we should note that although the ensemble trees are built with predicting the target outcome in mind, the RuleFit algorithm does not really care about the end prediction results. It merely uses this tree-building exercise as the vehicle to extract meaningful, quantitative rules.

Effectively, this means that we will discard the predicted value in each node and only keep the conditions that lead us to a node. Those conditions produce the rules we care about.

Ok, we can now wrap up the first processing step in the RuleFit algorithm: the rule building. The outcome of this step is a pool of candidate rules that could potentially explain the specific data behavior.

But out of all those rules, which ones actually deserve our attention?

Well, this is where the second step of RuleFit comes in. We “fit” to rank.

1.2 The “Fit” in RuleFit

Essentially, RuleFit uncovers the most important rules via feature selection.

First, RuleFit treats each rule as a new binary feature, that is, if the rule is satisfied for a specific sample, it gets a value of 1 for this binary feature; otherwise, its value is 0.

Then, RuleFit performs sparse linear regression with Lasso by using all the “raw” features from the original dataset, as well as the newly engineered binary features derived from the rules, to predict the target outcome. This way, each feature (raw features + binary rule features) gets a coefficient.

One key characteristic of Lasso is that its loss function forces the coefficients of those unimportant features to be exactly zero. This effectively means those unimportant features are removed from the model.

As a result, by simply examining which binary rule features survived the Lasso analysis, we would immediately know which rules are important in terms of getting accurate predictions of the target outcome. In addition, by looking at the coefficient magnitudes associated with the rule features, we would be able to rank the importance of the rules.

1.3 Recap

We have just covered the essential theory behind the RuleFit algorithm. To summarize, we can view this approach as a two-step solution for providing explainability:

(1) It first extracts the rules by training an ensemble of decision trees. That’s the “Rule” part.

(2) It then cleverly converts those rules into binary features and performs standard feature selection by using sparse linear regression (Lasso). That’s the “Fit” part.

Finally, the surviving rules with non-zero coefficients are important ones that are worth our attention.

At this point, you may have noticed that “predicting target outcome” pops up at both the “Rule” and “Fit” steps. If we are dealing with a regression or classification problem, it is easily understandable that the “target outcome” is the numerical value or the label we want to predict, and the rules can be interpreted as patterns that drive the prediction.

But what about anomaly detection, which is largely an unsupervised task? How can we apply RuleFit there?

2. Anomaly Explanation with RuleFit

2.1 Application Pattern

To begin with, we need to transform the unsupervised explainability problem into a supervised one. Here’s how.

Once we have our anomaly detection results (doesn’t matter which algorithm we used), we can create binary labels, i.e., 1 for an identified anomaly and 0 for a normal data point, as our “target outcome.” This way, we have exactly what RuleFit needs: the raw features, and the target outcome to predict.

Then, the RuleFit can work its magic to generate a pool of candidate rules and fit a sparse linear regression model to retain only the important rules. The coefficients of the resulting model would then indicate how much each rule contributes to the log-odds of an instance being classified as an anomaly. To put it another way, they tell us which rule combinations most strongly push a sample toward being labeled as anomalous.

Note that you can, in theory, also use the anomaly score (produced by the primary anomaly detection model) as the “target outcome”. This will change the application of RuleFit from a classification setting to a regression setting.

Both approaches are valid, but they answer slightly different questions: With the binary label classification setting, the RuleFit uncovers “What makes something an anomaly?“; With the anomaly score regression setting, the RuleFit uncovers “What drives the severity of an anomaly?“.

In practice, the rules generated by both approaches will probably be very similar. Nevertheless, using a binary anomaly label as the target for a RuleFit is more commonly used for explaining detected anomalies. It is straightforward in terms of interpretation and direct applicability to creating business rules for flagging future anomalies.

2.2 Case Study

Let’s walk through a concrete example to see how RuleFit works in action. Here, we’ll create an anomaly detection scenario using the Iris dataset [2] (licensed CC BY 4.0), where each sample consists of 4 features (sepal_length, sepal_width, petal_length, petal_width) and is labeled as one of the following three categories: Setosa, Versicolor, and Virginica.

Step 1: Data Setup

First, we’ll use all Setosa samples (50) and all Versicolor samples (50) as the “normal” samples. For the “abnormal” samples, we’ll use a subset of Virginica samples (10).

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix
np.random.seed(42)

# Load the Iris dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y_true = iris.target

# Get normal samples (Setosa + Versicolor)
normal_mask = (y_true == 0) | (y_true == 1)
X_normal_all = X[normal_mask].copy()

# Get Virginica samples
virginica_mask = (y_true == 2)
X_virginica = X[virginica_mask].copy()

# Randomly select 10
anomaly_indices = np.random.choice(len(X_virginica), size=10, replace=False)
X_anomalies = X_virginica.iloc[anomaly_indices].copy()

To make the scenario more realistic, we create a separate training set and test set. The train set contains pure “normal” samples, while the test set consists of randomly sampled 20 “normal” samples and 10 “abnormal” samples.

train_indices = np.random.choice(len(X_normal_all), size=80, replace=False)
test_indices = np.setdiff1d(np.arange(len(X_normal_all)), train_indices)

X_train = X_normal_all.iloc[train_indices].copy()
X_normal_test = X_normal_all.iloc[test_indices].copy()

# Create test set (20 normal + 10 anomalous)
X_test = pd.concat([X_normal_test, X_anomalies], ignore_index=True)
y_test_true = np.concatenate([
    np.zeros(len(X_normal_test)),   
    np.ones(len(X_anomalies))       
])

Step 2: Anomaly Detection

Next, we perform anomaly detection. Here, we pretend we don’t know the actual labels. In this case study, we apply Local Outlier Factor (LOF) as the anomaly detection algorithm, which locates anomalies by measuring how isolated a data point is compared to the density of its local neighbors. Of course, you can also try other anomaly detection algorithms, such as Gaussian Mixture Models (GMM), K-Nearest Neighbors (KNN), and Autoencoders, among others. However, keep in mind that the intention here is only to get the detection results, our main focus is the anomaly explanation in step 3.

Specifically, we’ll use the pyOD library to train the model and make inferences:

# Install the pyOD library
#!pip install pyod

from pyod.models.lof import LOF

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Local Outlier Factor
lof = LOF(n_neighbors=3)
lof.fit(X_train_scaled)

train_scores = lof.decision_function(X_train_scaled)
test_scores = lof.decision_function(X_test_scaled)
threshold = np.percentile(train_scores_lof, 99)
y_pred = (test_scores > threshold).astype(int)

Notice that we have used the 99% quantile of the anomaly scores obtained on the training set as the threshold. For individual test samples, if its anomaly score is higher than the threshold, this sample will be labeled as “anomaly”. Otherwise, the sample is considered “normal”.

At this stage, we can quickly check the detection performance with:

classification_report(y_test_true, y_pred, target_names=['Normal', 'Anomaly'])

Not super great results. Out of 10 true anomalies, only 5 of them are caught. However, the good news is that LOF didn’t produce any false positives. You can further improve the performance by tuning the LOF model hyperparameters, adjusting the threshold, or even considering ensemble learning strategies. But keep in mind: our goal here is not to get the best detection accuracy. Instead, we aim to see if RuleFit can properly generate rules to explain the anomalies detected by the LOF model.

Step 3: Anomaly Explanation

Now we are getting to the core topic. To apply RuleFit, let’s first install the library from imodels, which is a sklearn-compatible, Interpretable ML package for concise, transparent, and accurate predictive modeling:

pip install imodels

In this case, we will consider a binary label classification setting, where the abnormal samples (in the test set) flagged by the LOF model are labeled as 1, and other un-flagged normal samples (also in the test set) are labeled as 0. Note that we are labeling based on LOF’s detection results, not the actual ground truth, which we pretend we don’t know.

To initiate the RuleFit model:

from imodels import RuleFitClassifier

rf = RuleFitClassifier(                 
        max_rules = 30,           
        lin_standardise=True,           
        include_linear=True,           
        random_state = 42
)

We can then proceed with fitting the RuleFit model:

rf.fit(
    X_test, 
    y_pred, 
    feature_names=X_test.columns
)

In practice, it is usually a good practice to do a quick sanity check to evaluate how well the RuleFit model’s predictions align with the anomaly labels determined by the LOF algorithm:

from sklearn.metrics import accuracy_score, roc_auc_score

y_label = rf.predict(X_test)               
y_prob  = rf.predict_proba(X_test)[:, 1]   

print("accuracy:", accuracy_score(y_pred, y_label))
print("roc-auc:", roc_auc_score (y_pred, y_prob))

For our case, we see that both printouts are 1. This confirms that the RuleFit model has successfully learned the patterns that LOF used to identify anomalies. For your own problems, if you observe values much lower than 1, you would need to fine-tune your RuleFit hyperparameters.

Now let’s examine the rules:

rules = rf._get_rules()
rules = rules[rules.coef != 0]                         
rules = rules[~rules.type.str.contains('linear')]      
rules['abs_coef'] = rules['coef'].abs()
rules = rules.sort_values('importance', ascending=False)

The RuleFit algorithm returns a total of 24 rules. A snapshot is shown below:

Let’s first clarify the meaning of the results columns:

The “rule” column and the “abs_coef” column are self-explanatory.
The “type” column has two unique values: “linear” and “rule”. The “linear” denotes the original input features, while “rule” denotes the “IF-THEN” conditions generated from decision trees.
The “coef” column represents the coefficients produced by the Lasso regression analysis. A positive value indicates that if the rule applies, the log-odds of being classified as the abnormal class increases. A larger magnitude indicates a stronger influence of that rule on the prediction.
The “support” column records the fraction of data samples where the rule applies.
The “importance” column is calculated as the absolute value of the coefficient multiplied by the standard deviation of the binary (0 or 1) values that the rule takes on. So why this calculation? As we have just discussed, a larger absolute coefficient means a stronger direct impact on the log-odds. That’s clear. For the standard deviation term, it effectively measures the “discriminative power” of the rules. For example, if a rule is almost always TRUE (very small standard deviation), it doesn’t split your data effectively. The same holds if the rule is almost always FALSE. In other words, the rule cannot explain much of the variation in the target variable. Therefore, the importance score combines both the strength of the rule’s impact (coefficient magnitude) and how well it discriminates between different samples (standard deviation).

For our specific case, we see only one high-impact rule (Rule #24):

If a flower’s petal is longer than 5.45 cm and wider than 2 cm, the odds that LOF classifies it as “anomalous” increase 85-fold. (Note that exp(4.448999) ~= 85)

Rules #26 and #27 are nested inside Rule #24. This is common in practice, as RuleFit often produces “families” of similar rules because they come from neighbouring tree splits. Therefore, the only rule that truly matters for characterizing the LOF-identified anomalies is Rule #24.

Also, we see that the support for Rule #24 is 0.1667 (5/30). This effectively means that all 5 LOF-identified anomalies can be explained by this rule. We can see that more clearly in the figure below:

There you have it: the rule to describe the identified anomalies!

3. Conclusion

In this blog post, we explored the RuleFit algorithm as a powerful solution for explainable anomaly detection. We discussed:

How it works: A two-step approach where decision trees are first fitted to derive meaningful rules, followed by a sparse linear regression to rank the rule importance.
How to apply to anomaly explanation: Use the detection results as the pseudo labels and use them as the “target outcome” for the RuleFit model.

With RuleFit in your modeling toolkit, the next time stakeholders ask “Why is this anomaly?”, you’ll have concrete IF-THEN rules that they can understand and act upon.

Reference

[1] Jerome H. Friedman, Bogdan E. Popescu, Predictive learning via rule ensembles, arXiv, 2008.

[2] Fisher, R. A., Iris [Data set]. UCI Machine Learning Repository, 1936.