Automated Testing: A Software Engineering Concept Data Scientists Must Know To Succeed

Why you should read this article

data scientists whip up a Jupyter Notebook, play around in some cells, and then maintain entire data processing and model training pipelines in the same notebook.

The code is tested once when the notebook was first written, and then it is neglected for some undetermined amount of time – days, weeks, months, years, until:

The outputs of the notebook need to be rerun to re-generate outputs that were lost.
The notebook needs to be rerun with different parameters to retrain a model.
Something needed to be changed upstream, and the notebook needs to be rerun to refresh downstream datasets.

Many of you will have felt shivers down your spine reading this…

Why?

Because you instinctively know that this notebook is never going to run.

You know it in your bones the code in that notebook will need to be debugged for hours at best, re-written from scratch at worst.

In both cases, it will take you a long time to get what you need.

Why does this happen?

Is there any way of avoiding this?

Is there a better way of writing and maintaining code?

This is the question we will be answering in this article.

The Solution: Automated Testing

What is it?

As the name suggests, automated testing is the process of running a predefined set of tests on your code to ensure that it is working as expected.

These tests verify that your code behaves as expected — especially after changes or additions — and alert you when something breaks. It removes the need for a human to manually test your code, and there is no need to run it on actual data.

Convenient, isn’t it?

Types of Automated Testing

There are so many different types of testing, and covering all of them is beyond the scope of this article.

Let’s just focus on the two main types most relevant to a data scientist:

Unit Tests
Integration Tests

Unit Tests

Image by author. Illustration of the concept of a unit test.

Tests the smallest parts of code in isolation (e.g., a function).

The function should do one thing only to make it easy to test. Give it a known input, and check that the output is as expected.

Integration Tests

Image by author. Illustration of the concept of an integration test.

Tests how multiple components work together.

For us data scientists, it means checking whether data loading, merging, and preprocessing steps produce the expected final dataset, given a known input dataset.

A practical example

Enough with the theory, let’s see how it works in practice.

We will go through a simple example where a data scientist has written some code in a Jupyter notebook (or script), one that many data scientists will have seen in their jobs.

We will pick up on why the code is bad. Then, we’ll try and make it better.

By better, we mean:

Easy to test
Easy to read

which ultimately means easy to maintain, because in the long run, good code is code that works, keeps working, and is easy to maintain.

We will then design some unit tests for our improved code, highlighting why the changes are beneficial for testing. To prevent this article from becoming too long, I will defer examples of integration testing to a future article.

Then, we will go through some rules of thumb for what code to test.

Finally, we will cover how to run tests and how to structure projects.

Example Pipeline

We will use the following pipeline as an example:

# bad_pipeline.py

import pandas as pd

# Load data
df1 = pd.read_csv("data/users.csv")
df2 = pd.read_parquet("data/transactions.parquet")
df3 = pd.read_parquet("data/products.parquet")

# Preprocessing
# Merge user and transaction data
df = df2.merge(df1, how='left', on='user_id')

# Merge with product data
df = df.merge(df3, how='left', on='product_id')

# Filter for recent transactions
df = df[df['transaction_date'] > '2023-01-01']

# Calculate total price
df['total_price'] = df['quantity'] * df['price']

# Create customer segment
df['segment'] = df['total_price'].apply(lambda x: 'high' if x > 100 else 'low')

# Drop unnecessary columns
df = df.drop(['user_email', 'product_description', 'price'], axis=1)

# Group by user and segment to get total amount spent
df = df.groupby(['user_id', 'segment']).agg({'total_price': 'sum'}).reset_index()

# Save output
df.to_parquet("data/final_output.parquet")

In real life, we would see hundreds of lines of code crammed into a single notebook. But the script is exemplary of all the things that need fixing in typical data science notebooks.

This code is doing the following:

Loads user, transaction, and product data.
Merges them into a unified dataset.
Filters recent transactions.
Adds calculated fields (total_price, segment).
Drops irrelevant columns.
Aggregates total spending per user and segment.
Saves the result as a Parquet file.

Why is this pipeline bad?

Oh, there are so many reasons coding in this manner is bad, depending on what lens you look at it from. It’s not the content that is the problem, but how it is structured.

While there are many angles we can discuss the disadvantages of writing code this way, for this article we will focus on testability.

1. Tightly coupled logic (in other words, no modularity)

All operations are crammed into a single script and run at once. It’s unclear what each part does unless you read every line. Even for a script this simple, this is difficult to do. In real-life scripts, it can only get worse when code can reach hundreds of lines.

This makes it impossible to test.

The only way to do so would be to run the entire thing all at once from start to finish, probably on actual data that you’re going to use.

If your dataset is small, then perhaps you can get away with this. But in most cases, data scientists are working with a truck-load of data, so it’s infeasible to run any form of a test or sanity check quickly.

We need to be able to break the code up into manageable chunks that do one thing only, and do it well. Then, we can control what goes in, and confirm that what we expect comes out of it.

2. No Parameterization

Hardcoded file paths and values like 2023-01-01 make the code brittle and inflexible. Again, hard to test with anything but the live/production data.

There’s no flexibility in how we can run the code, everything is fixed.

What’s worse, as soon as you change something, you have no assurance that nothing’s broken further down the script.

For example, how many times have you made a change that you thought was benign, only to run the code and find a completely unexpected part of the code to break?

How to improve?

Now, let’s see step-by-step how we can improve this code.

Please note, we will assume that we are using the pytest module for our tests going forwards.

1. A clear, configurable entry point

def run_pipeline(
    user_path: str,
    transaction_path: str,
    product_path: str,
    output_path: str,
    cutoff_date: str = '2023-01-01'
):
    # Load data
    ...

    # Process data
    ...

    # Save result
    ...

We start off by creating a single function that we can run from anywhere, with clear arguments that can be changed.

What does this achieve?

This allows us to run the pipeline in specific test conditions.

# GIVEN SOME TEST DATA
test_args = dict(
	test_user_path = "/fake_users.csv",
	test_transaction_path = "/fake_transaction.parquet",
	test_product_path = "/fake_products.parquet",
	test_cutoff_date = "",
)

# RUN THE PIPELINE THAT'S TO BE TESTED
run_pipeline(**test_args)

# TEST THE OUTPUT IS AS EXPECTED
output = 
expected_output = 
assert output == expected_output

Immediately, you can start passing in different inputs, different parameters, depending on the edge case that you want to test for.

It gives you flexibility to run the code in different settings by making it easier to control the inputs and outputs of your code.

Writing your pipeline in this way paves the way for integration testing your pipeline. More on this in a later article.

2. Group code into meaningful chunks that do one thing, and do it well

Now, this is where a bit of art comes in – different people will organise code differently depending on which parts they find important.

There is no right or wrong answer, but the common sense is to make sure a function does one thing and does it well. Do this, and it becomes easy to test.

One way we could group our code is like below:

def load_data(user_path: str, transaction_path: str, product_path: str):
    """Load data from specified paths"""
    df1 = pd.read_csv(user_path)
    df2 = pd.read_parquet(transaction_path)
    df3 = pd.read_parquet(product_path)
    return df1, df2, df3

def create_user_product_transaction_dataset(
    user_df:pd.DataFrame,
    transaction_df:pd.DataFrame,
    product_df:pd.DataFrame
):
    """Merge user, transaction, and product data into a single dataset.
    
    The dataset identifies which user bought what product at what time and price.
    
    Args:
	    user_df (pd.DataFrame):
            A dataframe containing user information. Must have column
            'user_id' that uniquely identifies each user.
	    
	    transaction_df (pd.DataFrame):
            A dataframe containing transaction information. Must have
            columns 'user_id' and 'product_id' that are foreign keys
            to the user and product dataframes, respectively.
	    
	    product_df (pd.DataFrame):
            A dataframe containing product information. Must have
            column 'product_id' that uniquely identifies each product.
    
    Returns:
        A dataframe that merges the user, transaction, and product data
        into a single dataset.
    """
    df = transaction_df.merge(user_df, how='left', on='user_id')
    df = df.merge(product_df, how='left', on='product_id')
    return df

def drop_unnecessary_date_period(df:pd.DataFrame, cutoff_date: str):
    """Drop transactions that happened before the cutoff date.

    Note:
        Anything before the cutoff date can be dropped because
        of .

    Args:
        df (pd.DataFrame): A dataframe with a column `transaction_date`
        cutoff_date (str): A date in the format 'yyyy-MM-dd'
        
    Returns:
        A dataframe with the transactions that happened after the cutoff date
    """
    df = df[df['transaction_date'] > cutoff_date]
    return df

def compute_secondary_features(df:pd.DataFrame) -> pd.DataFrame:
    """Compute secondary features.
    
    Args:
        df (pd.DataFrame): A dataframe with columns `quantity` and `price`
    
    Returns:
        A dataframe with columns `total_price` and `segment`
        added to it.
    """
    df['total_price'] = df['quantity'] * df['price']
    df['segment'] = df['total_price'].apply(lambda x: 'high' if x > 100 else 'low')
    return df

What does the grouping achieve?

Better documentation

Well, first of all, you end up with some natural retail space in your code to add docstrings. Why is this important? Well have you tried reading your own code a month after writing it?

People forget details very quickly, and even code *you’ve* written can become undecipherable within just a few days.

It’s essential to document what the code is doing, what it expects to take as input, and what it returns, at the very least.

Including docstrings in your code provides context and sets expectations for how a function should behave, making it easier to understand and debug failing tests in the future.

Better Readability

By ‘encapsulating’ the complexity of your code into smaller functions, you can make it easier to read and understand the overall flow of a pipeline without having to read every single line of code.

def run_pipeline(
    user_path: str,
    transaction_path: str,
    product_path: str,
    output_path: str,
    cutoff_date: str
):
    user_df, transaction_df, product_df = load_data(
        user_path,
        transaction_path,
        product_path
    )
    df = create_user_product_transaction_dataset(
        user_df,
        transaction_df,
        product_df
    )
    df = drop_unnecessary_date_period(df, cutoff_date)
    df = compute_secondary_features(df)
    df.to_parquet(output_path)

You’ve provided the reader with a hierarchy of information, and it gives the reader a step-by-step breakdown of what’s happing in the run_pipeline function through meaningful function names.

The reader then has the choice of looking at the function definition and the complexity within, depending on their needs.

The act of combining code into ‘meaningful’ chunks like this is demonstrating a concept called ‘Encapsulation’ and ‘Abstraction’.

For more details on encapsulation, you can read my article on this here

Smaller packets of code to test

Next, we have a very specific, well-defined set of functions that do one thing. This makes it easier to test and debug, since we only have one thing to worry about.

See below on how we construct a test.

Constructing a Unit Test

1. Follow the AAA Pattern

def test_create_user_product_transaction_dataset():
    # GIVEN

    # RUN

    # TEST
    ...

Firstly, we define a test function, appropriately named test_.

Then, we divide it into three sections:

GIVEN: the inputs to the function, and the expected output. Set up everything required to run the function we want to test.
RUN: run the function given the inputs.
TEST: compare the output of the function to the expected output.

This is a generic pattern that unit tests should follow. The standard name for this design pattern is the ‘AAA pattern’, which stands for Arrange, Act, Assert.

I don’t find this naming intuitive, which is why I use GIVEN, RUN, TEST.

2. Arrange: set up the test

# GIVEN
user_df = pd.DataFrame({
    'user_id': [1, 2, 3], 'name': ["John", "Jane", "Bob"]
})
transaction_df = pd.DataFrame({
    'user_id': [1, 2, 3],
    'product_id': [1, 1, 2],
    'extra-column1-str': ['1', '2', '3'],
    'extra-column2-int': [4, 5, 6],
    'extra-column3-float': [1.1, 2.2, 3.3],
})
product_df = pd.DataFrame({
    'product_id': [1, 2], 'product_name': ["apple", "banana"]
})
expected_df = pd.DataFrame({
    'user_id': [1, 2, 3],
    'product_id': [1, 1, 2],
    'extra-column1-str': ['1', '2', '3'],
    'extra-column2-int': [4, 5, 6],
    'extra-column3-float': [1.1, 2.2, 3.3],
    'name': ["John", "Jane", "Bob"],
    'product_name': ["apple", "apple", "banana"],
})

Secondly, we define the inputs to the function, and the expected output. This is where we bake in our expectations about how the inputs will look like, and what the output should look like.

As you can see, we don’t need to define every single feature that we expect to be run, only the ones that matter for the test.

For example, transaction_df defines the user_id, product_id columns properly, whilst also adding three columns of different types (str, int, float) to simulate the fact that there will be other columns.

The same goes for product_df and user_df, though these tables are expected to be a dimension table, so just defining name and product_name columns will suffice.

3. Act: Run the function to test

# RUN
output_df = create_user_product_transaction_dataset(
    user_df, transaction_df, product_df
)

Thirdly, we run the function with the inputs we defined, and collect the output.

4. Assert: Test the outcome is as expected

# TEST
pd.testing.assert_frame_equal(
    output_df,
    expected_df
)

and finally, we check whether the output matches the expected output.

Note, we use the pandas testing module since we’re comparing pandas dataframes. For non-pandas datafames, you can use the assert statement instead.

The full testing code will look like this:

import pandas as pd

def test_create_user_product_transaction_dataset():
    # GIVEN
    user_df = pd.DataFrame({
        'user_id': [1, 2, 3], 'name': ["John", "Jane", "Bob"]
    })
    transaction_df = pd.DataFrame({
        'user_id': [1, 2, 3],
        'product_id': [1, 1, 2],
        'transaction_date': ["2021-01-01", "2021-01-01", "2021-01-01"],
        'extra-column1': [1, 2, 3],
        'extra-column2': [4, 5, 6],
    })
    product_df = pd.DataFrame({
        'product_id': [1, 2], 'product_name': ["apple", "banana"]
    })
    expected_df = pd.DataFrame({
        'user_id': [1, 2, 3],
        'product_id': [1, 1, 2],
        'transaction_date': ["2021-01-01", "2021-01-01", "2021-01-01"],
        'extra-column1': [1, 2, 3],
        'extra-column2': [4, 5, 6],
        'name': ["John", "Jane", "Bob"],
        'product_name': ["apple", "apple", "banana"],
    })
    
    # RUN
    output_df = create_user_product_transaction_dataset(
        user_df, transaction_df, product_df
    )

    # TEST
    pd.testing.assert_frame_equal(
        output_df,
        expected_df
    )

To organise your tests better and make them cleaner, you can start using a combination of classes, fixtures, and parametrisation.

It’s beyond the scope of this article to delve into each of these concepts in detail, so for those who are interested I provide the pytest How-To guide as reference to these concepts.

What to Test?

Now that we’ve created a unit test for one function, we turn our attention to the remaining functions that we have. Acute readers will now be thinking:

“Wow, do I have to write a test for everything? That’s a lot of work!”

Yes, it’s true. It’s extra code that you need to write and maintain.

But the good news is, it’s not necessary to test absolutely everything, but you need to know what’s important in the context of what your work is doing.

Below, I’ll give you a few rules of thumb and considerations that I make when deciding what to test, and why.

1. Is the code critical for the outcome of the project?

There are critical junctures in a data science project that are just pivotal to the success of a data science project, many of which usually comes at the data-preparation and model evaluation/explanation stages.

The example test we saw above on the create_user_product_transaction_dataset function is a good example.

This dataset will form the basis of all downstream modelling activity.

If the user -> product join is incorrect in whatever way, then it will impact everything we do downstream.

Thus, it’s worth taking the time to ensure this code works correctly.

At a bare minimum, the test we’ve established makes sure the function is behaving in exactly the same way as it used to after every code change.

Example

Suppose the join needs to be rewritten to improve memory efficiency.

After making the change, the unit test ensures the output remains the same.

If something was inadvertently altered such that the output started to look different (missing rows, columns, different datatypes), the test would immediately flag the issue.

2. Is the code mainly using third-party libraries?

Take the load data function for example:

def load_data(user_path: str, transaction_path: str, product_path: str):
    """Load data from specified paths"""
    df1 = pd.read_csv(user_path)
    df2 = pd.read_parquet(transaction_path)
    df3 = pd.read_parquet(product_path)
    return df1, df2, df3

This function is encapsulating the process of reading data from different files. Under the hood, all it does is call three pandas load functions.

The main value of this code is the encapsulation.

Meanwhile, it doesn’t have any business logic, and in my opinion, the function scope is so specific that you wouldn’t expect any logic to be added in the future.

If it does, then the function name should be changed as it does more than just loading data.

Therefore, this function does not require a unit test.

A unit test for this function would just be testing that pandas works properly, and we should be able to trust that pandas has tested their own code.

3. Is the code likely to change over time?

This point has already been implied in 1 & 2. For maintainability, perhaps this is the most important consideration.

You should be thinking:

How complex is the code? Are there many ways to achieve the same output?
What could cause someone to alter this code? Is the data source susceptible to changes in the future?
Is the code clear? Are there behaviours that could be easily overlooked during a refactor?

Take create_user_product_transaction_dataset for example.

The input data may have changes to their schema in the future.
Perhaps the dataset becomes larger, and we need to break up the merge into multiple steps for performance reasons.
Perhaps a dirty hack needs to go in temporarily to handle nulls due to an issue with the data source.

In each case, a change to the underlying code may be necessary, and each time we need to ensure the output doesn’t change.

In contrast, load_data does nothing but loads data from a file.

I don’t see this changing much in the future, other than perhaps a change in file format. So I’d defer writing a test for this until a significant change to the upstream data source occurs (something like this would most likely require changing a lot of the pipeline).

Where to Put Tests and How to Run Them

So far, we’ve covered how to write testable code and how to create the tests themselves.

Now, let’s look at how to structure your project to include tests — and how to run them effectively.

Project Structure

Generally, a data science project can follow the below structure:


|-- data                # where data is stored
|-- conf                # where config files for your pipelines are stored
|-- src                 # all the code to replicate your project is stored here
|-- notebooks           # all the code for one-off experiments, explorations, etc. are stored here
|-- tests               # all the tests are stored here
|-- pyproject.toml
|-- README.md
|-- requirements.txt

The src folder should contain all the code for the project that are critical for the delivery of your project.

General rule of thumb

If it’s code you anticipate running multiple times (with different inputs or parameters), it should go in the src folder.

Examples include:

data processing
feature engineering
model training
model evaluation

Meanwhile, anything that is one-off pieces of analysis can be in Jupyter notebooks, stored in the notebooks folder.

This primarily includes

EDA
ad-hoc model experimentation
analysis of local model explanations

Why?

Because Jupyter notebooks are notoriously flaky, difficult to manage, and hard to test. We don’t want to be rerunning critical code via notebooks.

The Test Folder Structure

Let’s say your src folder looks like this:

src
|-- pipelines
    |-- data_processing.py
    |-- feature_engineering.py
    |-- model_training.py
    |-- __init__.py

Each file contains functions and pipelines, similar to the example we saw above.

The test folder should then look like this:

tests
|-- pipelines
    |-- test_data_processing.py
    |-- test_feature_engineering.py
    |-- test_model_training.py

where the test directory mirrors the structure of the src directory and each file starts with the test_ prefix.

The reason for this is simple:

It’s easy to find the tests for a given file, since the test folder structure mirrors the src folder.
It keeps test code nicely separated from source code.

Running Tests

Once you have your tests set up like above, you can run them in a variety of ways:

1. Through the terminal

pytest -v

2. Through a code editor

I use this for all my projects.

Visual studio code is my editor of choice; it auto-discovers the tests for me, and it’s super easy to debug.

After having a read of the docs, I don’t think there’s any point in me re-iterating their contents since they are quite self-explanatory, so here’s the link:

Similarly, most code editors will also have similar capabilities, so there’s no excuse for not writing tests.

It really is simple, read the docs and get started.

3. Through a CI pipeline (e.g. GitHub Actions, Gitlab, etc.)

It’s easy to set up tests to run automatically on pull requests via GitHub.

The idea is whenever you make a PR, it will automatically find and run the tests for you.

This means that even if forget to run the tests locally via 1 or 2, they will always be run for you whenever you want to merge your changes.

Again, no point in me re-iterating the docs; here’s the link

The End-Goal We Want To Achieve

Following on from the above instructions, I think it’s better use of both of our time to highlight some important points about what we want to achieve through automated tests, rather than regurgitating instructions you can find in the above links.

First and foremost, automated tests are being written to establish trust in your code, and to minimise human error.

This is for the benefit of:

Yourself
Your team
and the business as a whole.

Therefore, to truly get the most out of the tests you’ve written, you must get round to setting up a CI pipeline.

It makes a world of difference being able to forget to run the tests locally, and still have the assurance that the tests will be run when you create a PR or push some changes.

You don’t want to be the person responsible for a bug that creates a production incident because you forgot to run the tests, or to be the one to have missed a bug during a PR review.

So please, if you write some tests, invest some time into setting up a CI pipeline. Read the github docs, I implore you. It is trivial to set up, and it will do you wonders.

Final Remarks

After reading this article, I hope it’s impressed upon you

The importance of writing tests, specifically within the context of data science
How easy it is to write and run them

But there is one last reason why you should know how to write automated test.

That reason is that

Data Science is changing.

Data science used to be largely proof-of-concept, building models in Jupyter notebooks, and sending models to engineers for deployment. Meanwhile, data scientists built up a notoriety for creating terrible code.

But now, the industry has matured.

It’s becoming easier to quickly build and deploy models as ML-Ops and ML-engineering mature.

Thus,

model building
deployment
retraining
maintenance

is becoming the task of machine learning engineers.

At the same time, the data wrangling that we used to do are becoming so complex that this is now becoming specialised to dedicated data engineering teams.

As a result, data science sits in a very narrow space between these two disciplines, and quite soon the lines between data scientist and data analyst will blur.

The trajectory is that data scientists will no longer be building cutting-edge models, but will become more business and product focused, generating insights and MI reports instead.

If you want to stay closer to the model building, it doesn’t suffice to just code anymore.

You need to learn how to code properly, and how to maintain them well. Machine learning is no longer a novelty, it is no longer just PoCs, it is becoming software engineering.

If You Want To Learn More

If you want to learn more about software engineering skills applied to Data Science, here are some related articles:

You can also become a Team Member on Patreon here!

We have dedicated discussion threads for all articles; Ask me questions about automated testing, discuss the topic in more detail, and share experiences with other data scientists. The learning doesn’t need to stop here.

You can find the dedicated discussion thread for this article here.