How to Perform Effective Data Cleaning for Machine Learning

the most important step you can perform in your machine-learning pipeline. Without data, your model algorithm improvements likely won’t matter. After all, the saying ‘garbage in, garbage out’ is not just a saying, but an inherent truth within machine learning. Without proper high-quality data, you will struggle to create a high-quality machine learning model.

This infographic summarizes the article. I start by explaining my motivation for this article and defining data cleaning as a task. I then continue discussing three different data cleaning techniques, and some notes to keep in mind when performing data cleaning. Image by ChatGPT.

In this article, I discuss how you can effectively apply data cleaning to your own dataset to improve the quality of your fine-tuned machine-learning models. I will go through why you need data cleaning and data cleaning techniques. Lastly, I will also provide important notes to keep in mind, such as keeping a short experimental loop

You can also read articles on OpenAI Whisper for Transcription, Attending NVIDIA GTC Paris 2025 , and Creating Powerful Embeddings for Machine Learning.

Motivation

My motivation for this article is that data is one of the most important aspects of working as a data scientist or ML engineer. This is why companies such as Tesla, DeepMind, OpenAI, and so many others are focused on data annotation. Tesla, for example, had around 1500 employees working on data annotation for their full self-driving.

However, if you have a low-quality dataset, you will struggle to have high-performing models. This is why cleaning your data after annotation is so important. Cleaning is essentially a foundational block of every machine-learning pipeline involving training a model.

Definition

To be explicit, I define data cleaning as a step you perform after your data annotation process. So you already have a set of samples and corresponding labels, and you now aim to clean those labels to ensure correctness.

Furthermore, the words annotation and labeling are often used interchangeably. I think they mean the same thing, but for consistency, I’ll use annotation only. With data annotation, I mean the process of setting a label on a data sample. For example, if you have an image of a cat, annotating the image means setting the annotation cat corresponding to the image.

Data cleaning techniques

It’s important to mention that in cases with smaller datasets, you can choose to go over all samples and annotations a second time. However, in a lot of scenarios, this is not an option, as data annotation takes too much time. This is why I’m listing several techniques below to perform data cleaning more effectively.

Clustering

Clustering is a common unsupervised technique in machine learning. With clustering, you assign a set of labels to data samples, without having an original dataset of samples and annotations.

However, clustering is also a fantastic data cleaning technique. This is the process I use to perform data cleaning with clustering:

Embed all of your data samples. This can be done using textual embeddings using a BERT model, visual embeddings using Squeezenet, or combined embeddings such as OpenAI’s CLIP embedding. The point is that you need a numerical representation of your data samples to perform the clustering
Apply a clustering technique. I prefer K-means, as it assigns a cluster to all data samples, unlike DB Scan, which also has outliers. (Outliers can be fitting in a lot of scenarios, but for data cleaning it is suboptimal). If you are using K-means, you should experiment with different values for the parameter K.
You now have a list of data samples and their assigned cluster. I then iterate through each cluster and check if there are differing labels within each cluster.

I now want to elaborate on step 3. Using an example. I will use a simple binary classification tasks of assigning images to the labels

Now I have 10 images, with the following cluster assignments. As a small example, I will have seven data samples, with two cluster assignments. In a table, the data samples look like this

Some example data samples along with their cluster assignment and labels. Table by the author,

If you can visualize it like below:

This plot shows a visualization of the example cluster. Image by the author.

I then use a for loop to go through each cluster, and decide which sample I want to look further at (see Python code for this further down)

Cluster A: In this cluster, all data samples have the same annotation (cat). The annotations are thus more likely to be correct. I do not need a secondary review of these samples
Cluster B: We definitely want to look more closely at the samples in this cluster. Here we have images, with embeddings located closely in the embedding space. This is highly suspect, as we expect similar embeddings to have the same labels. I will look closely at these four samples

You can see how you only had to go through 4/7 data samples?

This is how you save time. You only find the data samples that are the most likely to be incorrect. You can expand this technique to thousands of samples along with more clusters, and you will save an enormous amount of time.

I will now also provide code for this example to highlight how I do the clustering with Python.

First, let’s define the mock data:

sample_data = [
    {
        "image-idx": 0,
        "cluster": "A",
        "label": "Cat"
    },
    {
        "image-idx": 1,
        "cluster": "A",
        "label": "Cat"
    },
    {
        "image-idx": 2,
        "cluster": "A",
        "label": "Cat"
    },
    {
        "image-idx": 3,
        "cluster": "B",
        "label": "Cat"
    },
    {
        "image-idx": 4,
        "cluster": "B",
        "label": "Cat"
    },
    {
        "image-idx": 5,
        "cluster": "B",
        "label": "Dog"
    },
    {
        "image-idx": 6,
        "cluster": "B",
        "label": "Dog"
    },
    
]

Now let’s iterate over all clusters and find the samples we need to look at:

from collections import Counter
# first retrieve all unique clusters
unique_clusters = list(set(item["cluster"] for item in sample_data))

images_to_look_at = []
# iterate over all clusters
for cluster in unique_clusters:
    # fetch all items in the cluster
    cluster_items = [item for item in sample_data if item["cluster"] == cluster]

    # check how many of each label in this cluster
    label_counts = Counter(item["label"] for item in cluster_items)
    if len(label_counts) > 1:
        print(f"Cluster {cluster} has multiple labels: {label_counts}. ")
        images_to_look_at.append(cluster_items)
    else:
        print(f"Cluster {cluster} has a single label: {label_counts}")

print(images_to_look_at)

With this, you now only have to review the images_to_look at variable

Cleanlab

Cleanlab is another effective technique you can apply to clean your data. Cleanlab is a company offering a product to detect errors within your machine-learning application. However, they also open-sourced a tool on GitHub to perform data cleaning yourself, which is what I’ll be discussing here.

Essentially, Cleanlab takes your data, analyzes your input embeddings (for example, those you made with BERT, Squeezenet, or CLIP), as well as the output logits from the model. They then perform a statistical analysis on your data to detect samples with the highest likelihood of incorrect labels.

Cleanlab is a simple tool to set up, as it essentially only requires you to provide your input and output data, and it handles the complicated statistical analysis. I have used Cleanlab and seen how it has a strong ability to detect samples with potential annotation errors.

Considering that they have a good README available, I will leave the Cleanlab implementation up to the reader.

Predicting and comparing with annotations

The last data cleaning technique I’ll be going through is to use your fine-tuned machine-learning model to predict on samples and compare with your annotations. You can essentially use a technique like k-fold cross-validation, where you divide your datasets into several folds of different train and test splits, and predict on the entire dataset without leaking test data into your training set.

After you have predicted on your data, you can compare the predictions with the annotations you have on each sample. If the prediction corresponds with the annotation, you do not need to review the sample (there is a lower likelihood of this sample having the incorrect annotation).

Summary of techniques

I have presented three different techniques here

Clustering
Cleanlab
Predicting and comparing

The main point in each of these techniques is to filter out samples that have a high likelihood of being incorrect and only review those samples. With this, you only need to review a subset of your data samples, saving you immense amounts of time spent reviewing data. Different techniques will fit better in different scenarios.

You can of course also combine techniques together with either union or intersection:

Use the union between samples found with different techniques to find more samples likely to be incorrect
Use the intersection between samples, you believe to be incorrect to be sure of the samples that you believe to be incorrect

Important to keep in mind

I also want to have a short section on important points to keep in mind when performing data cleaning

Quality > quantity
Short experimental loop
The effort required to improve accuracy increases exponentially

I will now elaborate on each point.

Quality > quantity

When it comes to data, it is much more important to have a dataset of correctly annotated samples, rather than a larger dataset containing some incorrectly annotated samples. The reason is that when you train the model, it blindly trusts the annotations you have assigned, and will adapt the model weights to this ground truth

Imagine, for example, you have ten images of dogs and cats. Nine of the images are correctly annotated; however, one of the samples shows an image of a dog, while it is actually a cat. You are now telling the model that it should update its weights so that when it sees a dog, it should predict cat instead. This naturally strongly decreases the performance of the model, and you should avoid it at all costs.

Short experimental loop

When working on machine learning projects, it’s important to have a short experimental loop. This is because you often have to try out different configurations of hyperparameters or other similar settings.

For example ,when applying the third technique I described above of predicting using your model, and comparing the output against your own annotations, I recommend retraining the model often on your cleaned data. This will improve your model performance and allow you to detect incorrect annotations even better.

The effort required to improve accuracy increases exponentially

It’s important to note that when you are working on machine-learning projects, you should note what the requirements are beforehand. Do you need a model with 99% accuracy, or is 90% enough? If 90% is enough, you can likely save yourself a lot of time, as you can see in the graph below.

The graph is an example graph I made, and does not use any real data. However, it highlights an important note I have made while working on machine learning models. You can often quickly reach 90% accuracy (or what I define as a relatively good model. The exact accuracy will, of course, depend on your project. However, pushing that accuracy to 95% or even 99% will require exponentially more work.

Graph showing how the effort to increase accuracy increases exponentially towards 100% accuracy. Image by the author.

For example, when you first start data cleaning, retrain and retest your model, you will see rapid improvements. However, as you do more and more data cleaning, you will most likely see diminishing returns. Keep this in mind when working on projects and prioritizing where to spend your time.

Conclusion

In this article, I have discussed the importance of data annotation and data cleaning. I have introduced three techniques to apply effective data cleaning:

Clustering
Cleanlab
Predicting and comparing

Each of these techniques can help you detect data samples that are likely to be incorrectly annotated. Depending on your dataset, different techniques will differ in effectiveness, and you will typically have to try them out to see what works best for you and the problem you are working on.

Furthermore, I have discussed important notes to keep in mind when performing data cleaning. Remember that it’s more important to have high-quality annotations than to increase the quantity of annotations. If you keep that in mind, and ensure a short experimental loop, where you clean some data, retrain your model, and test again. You will see rapid improvements in your machine learning model’s performance.

👉 Follow me on socials:

🧑‍💻 Get in touch
🌐 Personal Blog
🔗 LinkedIn
🐦 X / Twitter
✍️ Medium
🧵 Threads

How to Perform Effective Data Cleaning for Machine Learning

Table of contents

Motivation

Definition