for an Old Challenge
You are training your model for spam detection. Your dataset has many more positives than negatives, so you invest countless hours of work to rebalance it to a 50/50 ratio. Now you are satisfied because you were able to address the class imbalance. What if I told you that 60/40 could have been not only enough, but even better?
In most machine learning classification applications, the number of instances of one class outnumbers that of other classes. This slows down learning [1] and can potentially induce biases in the trained models [2]. The most widely used methods to address this rely on a simple prescription: finding a way to give all classes the same weight. Most often, this is done through simple methods such as giving more importance to minority class examples (reweighting), removing majority class examples from the dataset (undersampling), or including minority class instances more than once (oversampling).
The validity of these methods is often discussed, with both theoretical and empirical work indicating that which solution works best depends on your specific application [3]. However, there is a hidden hypothesis that is seldom discussed and too often taken for granted: Is rebalancing even a good idea? To some extent, these methods work, so the answer is yes. But should we fully rebalance our datasets? To make it simple, let us take a binary classification problem. Should we rebalance our training data to have 50% of each class? Intuition says yes, and intuition guided practice until now. In this case, intuition is wrong. For intuitive reasons.
What Do We Mean by ‘Training Imbalance’?
Before we delve into how and why 50% is not the optimal training imbalance in binary classification, let us define some relevant quantities. We call n₀ the number of instances of one class (usually, the minority class), and n₁ those of the other class. This way, the total number of data instances in the training set is n=n₀+n₁ . The quantity we analyze today is the training imbalance,
ρ⁽ᵗʳᵃⁱⁿ⁾ = n₀/n .
Evidence that 50% Is Suboptimal
Initial evidence comes from empirical work on random forests. Kamalov and collaborators measured the optimal training imbalance, ρ⁽ᵒᵖᵗ⁾, on 20 datasets [4]. They find its value varies from problem to problem, but conclude that it is more or less ρ⁽ᵒᵖᵗ⁾=43%. This means that, according to their experiments, you want slightly more majority than minority class examples. This is however not the full story. If you want to aim at optimal models, don’t stop here and straightaway set your ρ⁽ᵗʳᵃⁱⁿ⁾ to 43%.
In fact, this year, theoretical work by Pezzicoli et al. [5], showed that the the optimal training imbalance is not a universal value that is valid for all applications. It is not 50% and it is not 43%. It turns out, the optimal imbalance varies. It can some times be smaller than 50% (as Kamalov and collaborators measured), and others larger than 50%. The specific value of ρ⁽ᵒᵖᵗ⁾ will depend on details of each specific classification problem. One way to find ρ⁽ᵒᵖᵗ⁾ is to train the model for several values of ρ⁽ᵗʳᵃⁱⁿ⁾, and measure the related performance. This could for example look like this:
Although the exact patterns determining ρ⁽ᵒᵖᵗ⁾ are still unclear, it seems that when data is abundant compared to the model size, the optimal imbalance is smaller than 50%, as in Kamalov’s experiments. However, many other factors — from how intrinsically rare minority instances are, to how noisy the training dynamics is — come together to set the optimal value of the training imbalance, and to determine how much performance is lost when one trains away from ρ⁽ᵒᵖᵗ⁾.
Why Perfect Balance Isn’t Always Best
As we said, the answer is actually intuitive: as different classes have different properties, there is no reason why both classes would carry the same information. In fact, Pezzicoli’s team proved that they usually do not. Therefore, to infer the best decision boundary we might need more instances of a class than of the other. Pezzicoli’s work, which is in the context of anomaly detection, provides us with a simple and insightful example.
Let us assume that the data comes from a multivariate Gaussian distribution, and that we label all the points to the right of a decision boundary as anomalies. In 2D, it would look like this:

The dashed line is our decision boundary, and the points on the right of the decision boundary are the n₀ anomalies. Let us now rebalance our dataset to ρ⁽ᵗʳᵃⁱⁿ⁾=0.5. To do so, we need to find more anomalies. Since the anomalies are rare, those that we are most likely to find are close to the decision boundary. Already by eye, the scenario is strikingly clear:

Anomalies, in yellow, are stacked along the decision boundary, and are therefore more informative about its position than the blue points. This might induce to think that it is better to privilege minority class points. On the other side, anomalies only cover one side of the decision boundary, so once one has enough minority class points, it can become convenient to invest in more majority class points, in order to better cover the other side of the decision boundary. As a consequence of these two competing effects, ρ⁽ᵒᵖᵗ⁾ is generally not 50%, and its exact value is problem dependent.
The Root Cause Is Class Asymmetry
Pezzicoli’s theory shows that the optimal imbalance is generally different from 50%, because different classes have different properties. However, they only analyze one source of diversity among classes, that is, outlier behavior. Yet, as it is for example shown by Sarao-Mannelli and coauthors [6], there are lots of effects, such as the presence of subgroups within classes, which can produce a similar effect. It is the concurrence of a very large number of effects determining diversity among classes, that tells us what the optimal imbalance for our specific problem is. Until we have a theory that treats all sources of asymmetry in the data together (including those induced by how the model architecture processes them), we cannot know the optimal training imbalance of a dataset beforehand.
Key Takeaways & What You Can Do Differently
If until now you rebalanced your binary dataset to 50%, you were doing well, but you were most likely not doing the best possible. Although we still do not have a theory that can tell us what the optimal training imbalance should be, now you know that it is likely not 50%. The good news is that it is on the way: machine learning theorists are actively addressing this topic. In the meantime, you can think of ρ⁽ᵗʳᵃⁱⁿ⁾ as a hyperparameter which you can tune beforehand, just as any other hyperparameter, to rebalance your data in the most efficient way. So before your next model training run, ask yourself: is 50/50 really optimal? Try tuning your class imbalance — your model’s performance might surprise you.
References
[1] E. Francazi, M. Baity-Jesi, and A. Lucchi, A theoretical analysis of the learning dynamics under class imbalance (2023), ICML 2023
[2] K. Ghosh, C. Bellinger, R. Corizzo, P. Branco,B. Krawczyk,and N. Japkowicz, The class imbalance problem in deep learning (2024), Machine Learning, 113(7), 4845–4901
[3] E. Loffredo, M. Pastore, S. Cocco and R. Monasson, Restoring balance: principled under/oversampling of data for optimal classification (2024), ICML 2024
[4] F. Kamalov, A.F. Atiya and D. Elreedy, Partial resampling of imbalanced data (2022), arXiv preprint arXiv:2207.04631
[5] F.S. Pezzicoli, V. Ros, F.P. Landes and M. Baity-Jesi, Class imbalance in anomaly detection: Learning from an exactly solvable model (2025). AISTATS 2025
[6] S. Sarao-Mannelli, F. Gerace, N. Rostamzadeh and L. Saglietti, Bias-inducing geometries: an exactly solvable data model with fairness implications (2022), arXiv preprint arXiv:2205.15935