A Practical Starters’ Guide to Causal Structure Learning with Bayesian Methods in Python

across variables can be a challenging but important step for strategic actions. I will summarize the concepts of causal models in terms of Bayesian probabilistic models, followed by a hands-on tutorial to detect causal relationships using Bayesian structure learning, Parameter learning, and further examine using inferences. I will use the sprinkler data set to conceptually explain how structures are learned with the use of the Python library bnlearn. After reading this blog, you can create causal networks and make inferences on your own data set.

This blog contains hands-on examples! This will help you to learn quicker, understand better, and remember longer. Grab a coffee and try it out! Disclosure: I’m the author of the Python packages bnlearn.

Background.

The use of machine learning techniques has become a standard toolkit to obtain useful insights and make predictions in many areas, such as disease prediction, recommendation systems, and natural language processing. Although good performances can be achieved, it is not straightforward to extract causal relationships with, for example, the target variable. In other words, which variables do have direct causal effect on the target variable? Such insights are important to determine the driving factors that reach the conclusion, and as such, strategic actions can be taken. A branch of machine learning is Bayesian probabilistic graphical models, also named Bayesian networks (BN), which can be used to determine such causal factors. Note that a lot of aliases exist for Bayesian graphical models, such as: Bayesian networks, Bayesian belief networks, Bayes Net, causal probabilistic networks, and Influence diagrams.

Let’s rehash some terminology before we jump into the technical details of causal models. It is common to use the terms “correlation” and “association” interchangeably. But we all know that correlation or association is not causation. Or in other words, observed relationships between two variables do not necessarily mean that one causes the other. Technically, correlation refers to a linear relationship between two variables, whereas association refers to any relationship between two (or more) variables. Causation, on the other hand, means that one variable (often called the predictor variable or independent variable) causes the other (often called the outcome variable or dependent variable) [1]. In the next two sections, I will briefly describe correlation and association by example in the next section.

Correlation.

Pearson’s correlation is the most commonly used correlation coefficient. It is so common that it is often used synonymously with correlation. The strength is denoted by r and measures the strength of a linear relationship in a sample on a standardized scale from -1 to 1. There are three possible results when using correlation:

Positive correlation: a relationship between two variables in which both variables move in the same direction
Negative correlation: a relationship between two variables in which an increase in one variable is associated with a decrease in the other, and
No correlation: when there is no relationship between two variables.

An example of positive correlation is demonstrated in Figure 1, where the relationship is seen between chocolate consumption and the number of Nobel Laureates per country [2].

Figure 1: correlation between Chocolate consumption vs. Nobel Laureates

The figure shows that chocolate consumption could imply an increase in Nobel Laureates. Or the other way around, an increase in Nobel laureates could likewise underlie an increase in chocolate consumption. Despite the strong correlation, it is more plausible that unobserved variables such as socioeconomic status or quality of the education system might cause an increase in both chocolate consumption and Nobel Laureates. Or in other words, it is still unknown whether the relationship is causal [2]. This does not mean that correlation by itself is useless; it simply has a different purpose [3]. Correlation by itself does not imply causation because statistical relations do not uniquely constrain causal relations. In the next section, we will dive into associations. Keep on reading!

Association.

When we talk about association, we mean that certain values of one variable tend to co-occur with certain values of the other variable. From a statistical point of view, there are many measures of association, such as the chi-square test, Fisher’s exact test, hypergeometric test, etc. Association measures are used when one or both variables are categorical, that is, either nominal or ordinal. It should be noted that correlation is a technical term, whereas the term association is not, and therefore, there is not always consensus about the meaning in statistics. This means that it’s always a good practice to state the meaning of the terms you’re using. More information about associations can be found at this GitHub repo: Hnet [5].

To demonstrate the use of associations, I will use the Hypergeometric test and quantify whether two variables are associated in the predictive maintenance data set [9] (CC BY 4.0 licence). The predictive maintenance data set is a so-called mixed-type data set containing a combination of continuous, categorical, and binary variables. It captures operational data from machines, including both sensor readings and failure events. The data set also records whether specific types of failures occurred, such as tool wear failure or heat dissipation failure, represented as binary variables. See the table below with details about the variables.

The table provides an overview of the variables in the predictive maintenance data set. There are different types of variables, identifiers, sensor readings, and target variables (failure indicators). Each variable is characterized by its role, data type, and a brief description.

One of the most important variables is machine failure and power failure. We would expect a strong association between these two variables. Let me demonstrate how to compute the association between the two. First, we need to install the bnlearn library and load the data set.

# Install Python bnlearn package
pip install bnlearn

import bnlearn
import pandas as pd
from scipy.stats import hypergeom

# Load predictive maintenance data set
df = bnlearn.import_example(data='predictive_maintenance')

# print dataframe
print(df)
+-------+------------+------+------------------+----+-----+-----+-----+-----+
|  UDI | Product ID  | Type | Air temperature  | .. | HDF | PWF | OSF | RNF |
+-------+------------+------+------------------+----+-----+-----+-----+-----+
|    1 | M14860      |   M  | 298.1            | .. |   0 |   0 |   0 |   0 |
|    2 | L47181      |   L  | 298.2            | .. |   0 |   0 |   0 |   0 |
|    3 | L47182      |   L  | 298.1            | .. |   0 |   0 |   0 |   0 |
|    4 | L47183      |   L  | 298.2            | .. |   0 |   0 |   0 |   0 |
|    5 | L47184      |   L  | 298.2            | .. |   0 |   0 |   0 |   0 |
| ...  | ...         | ...  | ...              | .. | ... | ... | ... | ... |
| 9996 | M24855      |   M  | 298.8            | .. |   0 |   0 |   0 |   0 |
| 9997 | H39410      |   H  | 298.9            | .. |   0 |   0 |   0 |   0 |
| 9998 | M24857      |   M  | 299.0            | .. |   0 |   0 |   0 |   0 |
| 9999 | H39412      |   H  | 299.0            | .. |   0 |   0 |   0 |   0 |
|10000 | M24859      |   M  | 299.0            | .. |   0 |   0 |   0 |   0 |
+-------+-------------+------+------------------+----+-----+-----+-----+-----+
[10000 rows x 14 columns]

Null hypothesis: There is no association between machine failure and power failure (PWF).

print(df[['Machine failure','PWF']])

| Index | Machine failure | PWF |
|-------|------------------|-----|
| 0     | 0                | 0   |
| 1     | 0                | 0   |
| 2     | 0                | 0   |
| 3     | 0                | 0   |
| 4     | 0                | 0   |
| ...   | ...              | ... |
| 9995  | 0                | 0   |
| 9996  | 0                | 0   |
| 9997  | 0                | 0   |
| 9998  | 0                | 0   |
| 9999  | 0                | 0   |
|-------|------------------|-----|

# Total number of samples
N=df.shape[0]

# Number of success in the population
K=sum(df['Machine failure']==1)

# Sample size/number of draws
n=sum(df['PWF']==1)

# Overlap between Power failure and machine failure
x=sum((df['PWF']==1) & (df['Machine failure']==1))

print(x-1, N, n, K)
# 94 10000 95 339

# Compute
P = hypergeom.sf(x, N, n, K)
P = hypergeom.sf(94, 10000, 95, 339)

print(P)
# 1.669e-146

The hypergeometric test uses the hypergeometric distribution to measure the statistical significance of a discrete probability distribution. In this example, N is the population size (10000), K is the number of successful states in the population (339), n is the sample size/number of draws (95), and x is the number of successes (94).

Equation 1: Test the association between machine failure and power failure using the Hypergeometric test.

We can reject the null hypothesis under alpha=0.05, and therefore, we can speak about a statistically significant association between machine failure and power failure. Importantly, association by itself does not imply causation. Strictly speaking, this statistic also does not tell us the direction of impact. We need to distinguish between marginal associations and conditional associations. The latter is the key building block of causal inference. Now that we have learned about associations, we can continue to causation in the next section!

Causation.

Causation means that one (independent) variable causes the other (dependent) variable and is formulated by Reichenbach (1956) as follows:

If two random variables X and Y are statistically dependent (X/Y), then either (a) X causes Y, (b) Y causes X, or (c ) there exists a third variable Z that causes both X and Y. Further, X and Y become independent given Z, i.e., X⊥Y∣Z.

This definition is incorporated in Bayesian graphical models. To explain this more thoroughly, let’s start with the graph and visualize the statistical dependencies between the three variables described by Reichenbach (X, Y, Z) as shown in Figure 2. Nodes correspond to variables (X, Y, Z), and the directed edges (arrows) indicate dependency relationships or conditional distributions.

Figure 2: DAGs encode conditional independencies. (a, b, c) are Equivalence classes. (a, b) Cascade, (c ) Common parent, and (d) is a special class with V-structure.

Four graphs can be created: (a) and (b) are cascade, (c) common parent, and (d) the V-structure. These four graphs form the basis for every Bayesian network.

1. How can we tell what causes what?

The conceptual idea to determine the direction of causality, thus which node influences which node, is by holding one node constant and then observing the effect. As an example, let’s take DAG (a) in Figure 2, which describes that Z is caused by X, and Y is caused by Z. If we now keep Z constant, there should not be a change in Y if this model is true. Every Bayesian network can be described by these four graphs, and with probability theory (see the section below) we can glue the parts together.

Bayesian network is a happy marriage between probability and graph theory.

It should be noted that a Bayesian network is a Directed Acyclic Graph (DAG), and DAGs are causal. This means that the edges in the graph are directed and there is no (feedback) loop (acyclic).

2. Probability theory.

Probability theory, or more specifically, Bayes’ theorem or Bayes Rule, forms the fundament for Bayesian networks. The Bayes’ rule is used to update model information, and stated mathematically as the following equation:

The equation consists of four parts;

The posterior probability is the probability that Z occurs given X.
The conditional probability or likelihood is the probability of the evidence given that the hypothesis is true. This can be derived from the data.
Our prior belief is the probability of the hypothesis before observing the evidence. This can also be derived from the data or domain knowledge.
The marginal probability describes the probability of the new evidence under all possible hypotheses, which needs to be computed.

If you want to read more about the (factorized) probability distribution or more details about the joint distribution for a Bayesian network, try this blog [6].

3. Bayesian Structure Learning to estimate the DAG.

With structure learning, we want to determine the structure of the graph that best captures the causal dependencies between the variables in the data set. Or in other words:

Structure learning is to determine the DAG that best fits the data.

A naïve manner to find the best DAG is by simply creating all possible combinations of the graph, i.e., by making tens, hundreds, or even thousands of different DAGs until all combinations are exhausted. Each DAG can then be scored on the fit of the data. Finally, the best-scoring DAG is returned. In the case of variables X, Y, Z, one can make the graphs as shown in Figure 2 and a few more, because it is not only X>Z>Y (Figure 2a), but it can also be Z>X>Y, etc. The variables X, Y, Z can be boolean values (True or False), but can also have multiple states. In the latter case, the search space of DAGs becomes so-called super-exponential in the number of variables that maximize the score. This means that an exhaustive search is practically infeasible with a large number of nodes, and therefore, various greedy strategies have been proposed to browse DAG space. With optimization-based search approaches, it is possible to browse a larger DAG space. Such approaches require a scoring function and a search strategy. A common scoring function is the posterior probability of the structure given the training data, like the BIC or the BDeu.

Structure learning for DAGs requires two components: 1. scoring function and 2. search strategy.

Before we jump into the examples, it is always good to understand when to use which technique. There are two broad approaches to search throughout the DAG space and find the best-fitting graph for the data.

Score-based structure learning
Constraint-based structure learning

Note that a local search strategy makes incremental changes aimed at improving the score of the structure. A global search algorithm like Markov chain Monte Carlo can avoid getting trapped in local minima, but I will not discuss that here.

4. Score-based Structure Learning.

Score-based approaches have two main components:

The search algorithm to optimize throughout the search space of all possible DAGs, such as ExhaustiveSearch, Hillclimbsearch, Chow-Liu.
The scoring function indicates how well the Bayesian network fits the data. Commonly used scoring functions are Bayesian Dirichlet scores such as BDeu or K2 and the Bayesian Information Criterion (BIC, also called MDL).

Four common score-based methods are depicted below, but more details about the Bayesian scoring methods can be found here [11].

ExhaustiveSearch, as the name implies, scores every possible DAG and returns the best-scoring DAG. This search approach is only attractive for very small networks and prohibits efficient local optimization algorithms to always find the optimal structure. Thus, identifying the ideal structure is often not tractable. Nevertheless, heuristic search strategies often yield good results if only a few nodes are involved (read: less than 5 or so).
Hillclimbsearch is a heuristic search approach that can be used if more nodes are used. HillClimbSearch implements a greedy local search that starts from the DAG “start” (default: disconnected DAG) and proceeds by iteratively performing single-edge manipulations that maximally increase the score. The search terminates once a local maximum is found.
Chow-Liu algorithm is a specific type of tree-based approach. The Chow-Liu algorithm finds the maximum-likelihood tree structure where each node has at most one parent. The complexity can be limited by restricting to tree structures.
Tree-augmented Naive Bayes (TAN) algorithm is also a tree-based approach that can be used to model huge data sets involving lots of uncertainties among its various interdependent feature sets [6].

5. Constraint-based Structure Learning

Chi-square test. A different, but quite straightforward approach to construct a DAG by identifying independencies in the data set using hypothesis tests, such as the chi2 test statistic. This approach does rely on statistical tests and conditional hypotheses to learn independence among the variables in the model. The P-value of the chi2 test is the probability of observing the computed chi2 statistic, given the null hypothesis that X and Y are independent, given Z. This can be used to make independent judgments, at a given level of significance. An example of a constraint-based approach is the PC algorithm, which starts with a complete, fully connected graph and removes edges based on the results of the tests if the nodes are independent until a stopping criterion is achieved.

The bnlearn library

A few words about the bnlearn library that is used for all the analyses in this article. bnlearn is Python package for causal discovery by learning the graphical structure of Bayesian networks, parameter learning, inference, and sampling methods. Because probabilistic graphical models can be difficult to use, bnlearn for Python contains the most-wanted pipelines. The key pipelines are:

Structure learning: Given the data, estimate a DAG that captures the dependencies between the variables.
Parameter learning: Given the data and DAG, estimate the (conditional) probability distributions of the individual variables.
Inference: Given the learned model, determine the exact probability values for your queries.
Synthetic Data: Generation of synthetic data.
Discretize Data: Discretize continuous data sets.

In this article, I don’t mention synthetic data, but if you want to learn more about data generation, read this blog here:

A Practical Starters’ Guide to Causal Structure Learning with Bayesian Methods in Python

Background.

Correlation.

Association.

Causation.

1. How can we tell what causes what?

2. Probability theory.

3. Bayesian Structure Learning to estimate the DAG.

4. Score-based Structure Learning.

5. Constraint-based Structure Learning

The bnlearn library

Structure Learning.

Parameter learning.

Parameter Learning on the Sprinkler Data set.

Inferences.

How do I know my causal model is right?

Discussion

Software

Let’s connect!

References

Leave a Reply Cancel reply

A Practical Starters’ Guide to Causal Structure Learning with Bayesian Methods in Python

Background.

Correlation.

Association.

Causation.

1. How can we tell what causes what?

2. Probability theory.

3. Bayesian Structure Learning to estimate the DAG.

4. Score-based Structure Learning.

5. Constraint-based Structure Learning

The bnlearn library

Structure Learning.

Parameter learning.

Parameter Learning on the Sprinkler Data set.

Inferences.

How do I know my causal model is right?

Discussion

Software

Let’s connect!

References

Related Posts

Why companies struggle with AI

Why companies struggle with AI

Leave a Reply Cancel reply