the most important step you can perform in your machine-learning pipeline. Without data, your model algorithm improvements likely won’t matter. After all, the saying ‘garbage in, garbage out’ is not just a saying, but an inherent truth within machine learning. Without proper high-quality data, you will struggle to create a high-quality machine learning model.
In this article, I discuss how you can effectively apply data cleaning to your own dataset to improve the quality of your fine-tuned machine-learning models. I will go through why you need data cleaning and data cleaning techniques. Lastly, I will also provide important notes to keep in mind, such as keeping a short experimental loop
You can also read articles on OpenAI Whisper for Transcription, Attending NVIDIA GTC Paris 2025, and Creating Powerful Embeddings for Machine Learning.
Table of contents
Motivation
My motivation for this article is that data is one of the most important aspects of working as a data scientist or ML engineer. This is why companies such as Tesla, DeepMind, OpenAI, and so many others are focused on data annotation. Tesla, for example, had around 1500 employees working on data annotation for their full self-driving.
However, if you have a low-quality dataset, you will struggle to have high-performing models. This is why cleaning your data after annotation is so important. Cleaning is essentially a foundational block of every machine-learning pipeline involving training a model.
Definition
To be explicit, I define data cleaning as a step you perform after your data annotation process. So you already have a set of samples and corresponding labels, and you now aim to clean those labels to ensure correctness.
Furthermore, the words annotation and labeling are often used interchangeably. I think they mean the same thing, but for consistency, I’ll use annotation only. With data annotation, I mean the process of setting a label on a data sample. For example, if you have an image of a cat, annotating the image means setting the annotation cat corresponding to the image.
Data cleaning techniques
It’s important to mention that in cases with smaller datasets, you can choose to go over all samples and annotations a second time. However, in a lot of scenarios, this is not an option, as data annotation takes too much time. This is why I’m listing several techniques below to perform data cleaning more effectively.
Clustering
Clustering is a common unsupervised technique in machine learning. With clustering, you assign a set of labels to data samples, without having an original dataset of samples and annotations.
However, clustering is also a fantastic data cleaning technique. This is the process I use to perform data cleaning with clustering:
- Embed all of your data samples. This can be done using textual embeddings using a BERT model, visual embeddings using Squeezenet, or combined embeddings such as OpenAI’s CLIP embedding. The point is that you need a numerical representation of your data samples to perform the clustering
- Apply a clustering technique. I prefer K-means, as it assigns a cluster to all data samples, unlike DB Scan, which also has outliers. (Outliers can be fitting in a lot of scenarios, but for data cleaning it is suboptimal). If you are using K-means, you should experiment with different values for the parameter K.
- You now have a list of data samples and their assigned cluster. I then iterate through each cluster and check if there are differing labels within each cluster.
I now want to elaborate on step 3. Using an example. I will use a simple binary classification tasks of assigning images to the labels
Now I have 10 images, with the following cluster assignments. As a small example, I will have seven data samples, with two cluster assignments. In a table, the data samples look like this

If you can visualize it like below:

I then use a for loop to go through each cluster, and decide which sample I want to look further at (see Python code for this further down)
- Cluster A: In this cluster, all data samples have the same annotation (cat). The annotations are thus more likely to be correct. I do not need a secondary review of these samples
- Cluster B: We definitely want to look more closely at the samples in this cluster. Here we have images, with embeddings located closely in the embedding space. This is highly suspect, as we expect similar embeddings to have the same labels. I will look closely at these four samples
You can see how you only had to go through 4/7 data samples?
This is how you save time. You only find the data samples that are the most likely to be incorrect. You can expand this technique to thousands of samples along with more clusters, and you will save an enormous amount of time.
I will now also provide code for this example to highlight how I do the clustering with Python.
First, let’s define the mock data:
sample_data = [
{
"image-idx": 0,
"cluster": "A",
"label": "Cat"
},
{
"image-idx": 1,
"cluster": "A",
"label": "Cat"
},
{
"image-idx": 2,
"cluster": "A",
"label": "Cat"
},
{
"image-idx": 3,
"cluster": "B",
"label": "Cat"
},
{
"image-idx": 4,
"cluster": "B",
"label": "Cat"
},
{
"image-idx": 5,
"cluster": "B",
"label": "Dog"
},
{
"image-idx": 6,
"cluster": "B",
"label": "Dog"
},
]
Now let’s iterate over all clusters and find the samples we need to look at:
from collections import Counter
# first retrieve all unique clusters
unique_clusters = list(set(item["cluster"] for item in sample_data))
images_to_look_at = []
# iterate over all clusters
for cluster in unique_clusters:
# fetch all items in the cluster
cluster_items = [item for item in sample_data if item["cluster"] == cluster]
# check how many of each label in this cluster
label_counts = Counter(item["label"] for item in cluster_items)
if len(label_counts) > 1:
print(f"Cluster {cluster} has multiple labels: {label_counts}. ")
images_to_look_at.append(cluster_items)
else:
print(f"Cluster {cluster} has a single label: {label_counts}")
print(images_to_look_at)
With this, you now only have to review the images_to_look at variable
Cleanlab
Cleanlab is another effective technique you can apply to clean your data. Cleanlab is a company offering a product to detect errors within your machine-learning application. However, they also open-sourced a tool on GitHub to perform data cleaning yourself, which is what I’ll be discussing here.
Essentially, Cleanlab takes your data, analyzes your input embeddings (for example, those you made with BERT, Squeezenet, or CLIP), as well as the output logits from the model. They then perform a statistical analysis on your data to detect samples with the highest likelihood of incorrect labels.
Cleanlab is a simple tool to set up, as it essentially only requires you to provide your input and output data, and it handles the complicated statistical analysis. I have used Cleanlab and seen how it has a strong ability to detect samples with potential annotation errors.
Considering that they have a good README available, I will leave the Cleanlab implementation up to the reader.
Predicting and comparing with annotations
The last data cleaning technique I’ll be going through is to use your fine-tuned machine-learning model to predict on samples and compare with your annotations. You can essentially use a technique like k-fold cross-validation, where you divide your datasets into several folds of different train and test splits, and predict on the entire dataset without leaking test data into your training set.
After you have predicted on your data, you can compare the predictions with the annotations you have on each sample. If the prediction corresponds with the annotation, you do not need to review the sample (there is a lower likelihood of this sample having the incorrect annotation).
Summary of techniques
I have presented three different techniques here
- Clustering
- Cleanlab
- Predicting and comparing
The main point in each of these techniques is to filter out samples that have a high likelihood of being incorrect and only review those samples. With this, you only need to review a subset of your data samples, saving you immense amounts of time spent reviewing data. Different techniques will fit better in different scenarios.
You can of course also combine techniques together with either union or intersection:
- Use the union between samples found with different techniques to find more samples likely to be incorrect
- Use the intersection between samples, you believe to be incorrect to be sure of the samples that you believe to be incorrect
Important to keep in mind
I also want to have a short section on important points to keep in mind when performing data cleaning
- Quality > quantity
- Short experimental loop
- The effort required to improve accuracy increases exponentially
I will now elaborate on each point.
Quality > quantity
When it comes to data, it is much more important to have a dataset of correctly annotated samples, rather than a larger dataset containing some incorrectly annotated samples. The reason is that when you train the model, it blindly trusts the annotations you have assigned, and will adapt the model weights to this ground truth
Imagine, for example, you have ten images of dogs and cats. Nine of the images are correctly annotated; however, one of the samples shows an image of a dog, while it is actually a cat. You are now telling the model that it should update its weights so that when it sees a dog, it should predict cat instead. This naturally strongly decreases the performance of the model, and you should avoid it at all costs.
Short experimental loop
When working on machine learning projects, it’s important to have a short experimental loop. This is because you often have to try out different configurations of hyperparameters or other similar settings.
For example ,when applying the third technique I described above of predicting using your model, and comparing the output against your own annotations, I recommend retraining the model often on your cleaned data. This will improve your model performance and allow you to detect incorrect annotations even better.
The effort required to improve accuracy increases exponentially
It’s important to note that when you are working on machine-learning projects, you should note what the requirements are beforehand. Do you need a model with 99% accuracy, or is 90% enough? If 90% is enough, you can likely save yourself a lot of time, as you can see in the graph below.
The graph is an example graph I made, and does not use any real data. However, it highlights an important note I have made while working on machine learning models. You can often quickly reach 90% accuracy (or what I define as a relatively good model. The exact accuracy will, of course, depend on your project. However, pushing that accuracy to 95% or even 99% will require exponentially more work.

For example, when you first start data cleaning, retrain and retest your model, you will see rapid improvements. However, as you do more and more data cleaning, you will most likely see diminishing returns. Keep this in mind when working on projects and prioritizing where to spend your time.
Conclusion
In this article, I have discussed the importance of data annotation and data cleaning. I have introduced three techniques to apply effective data cleaning:
- Clustering
- Cleanlab
- Predicting and comparing
Each of these techniques can help you detect data samples that are likely to be incorrectly annotated. Depending on your dataset, different techniques will differ in effectiveness, and you will typically have to try them out to see what works best for you and the problem you are working on.
Furthermore, I have discussed important notes to keep in mind when performing data cleaning. Remember that it’s more important to have high-quality annotations than to increase the quantity of annotations. If you keep that in mind, and ensure a short experimental loop, where you clean some data, retrain your model, and test again. You will see rapid improvements in your machine learning model’s performance.
👉 Follow me on socials:
🧑💻 Get in touch
🌐 Personal Blog
🔗 LinkedIn
🐦 X / Twitter
✍️ Medium
🧵 Threads