Home » A Refined Training Recipe for Fine-Grained Visual Classification

A Refined Training Recipe for Fine-Grained Visual Classification

, my research at Multitel has focused on fine-grained visual classification (FGVC). Specifically, I worked on building a robust car classifier that can work in real-time on edge devices. This post is part of what may become a small series of reflections on this experience. I’m writing to share some of the lessons I learned but also to organize and compound what I’ve learned. At the same time, I hope this gives a sense of the kind of high-level engineering and applied research we do at Multitel, work that blends academic rigor with real-world constraints. Whether you’re a fellow researcher, a curious engineer, or someone considering joining our team, I hope this post offers both insight and inspiration.

1. The problem:

We needed a system that could identify specific car models, not just “this is a BMW,” but which BMW model and year. And it needed to run in real time on resource-constrained edge devices alongside other models. This kind of task falls under what’s known as fine-grained visual classification (FGVC).

Example of two models along with discriminative parts [1].

FGVC aims to recognize images belonging to multiple subordinate categories of a super-category (e.g. species of animals / plants, models of cars etc).​ The difficulty lies with understanding fine-grained visual differences that sufficiently discriminate between objects that are highly similar in overall appearance but differ in fine-grained features [2].​

Fine-grained classification vs. general image classification​ [3].

What makes FGVC particularly tricky?

  • Small inter-class variation: The visual differences between classes can be extremely subtle.
  • Large intra-class variation: At the same time, instances within the same class may vary greatly due to changes in lighting, pose, background, or other environmental factors.
  • The subtle visual differences can be easily overwhelmed by the other factors such as poses and viewpoints​.
  • Long-tailed distributions: Datasets typically have a few classes with many samples and many classes with very few examples. For example, you might have only a couple of images of a rare spider species found in a remote region, while common species have thousands of images. This imbalance makes it difficult for models to learn equally well across all categories.
Two species of gulls from CUB 200 dataset illustrate the difficulty of fine-grained object classification [4].

2. The landscape:

When we first started tackling this problem, we naturally turned to literature. We dove into academic papers, examined benchmark datasets, and explored state-of-the-art FGVC methods. And at first, the problem seemed far more complicated than it actually turned out to be, at least in our specific context.

FGVC has been actively researched for years, and there’s no shortage of approaches that introduce increasingly complex architectures and pipelines. Many early works, for example, proposed two-stage models: a localization subnetwork would first identify discriminative object parts, and then a second network would classify based on those parts. Others focused on custom loss functions, high-order feature interactions, or label dependency modeling using hierarchical structures.

All of these methods were designed to tackle the subtle visual distinctions that make FGVC so challenging. If you’re curious about the evolution of these approaches, Wei et al [2]. provide a solid survey that covers many of them in depth.

Overview of the landscape of deep learning based fine-grained image analysis (FGIA) [2].

When we looked closer at recent benchmark results (archived from Papers with Code), many of the top-performing solutions were based on transformer architectures. These models often reached state-of-the-art accuracy, but with little to no discussion of inference time or deployment constraints. Given our requirements, we were fairly certain that these models wouldn’t hold up in real-time on an edge device already running multiple models in parallel.

At the time of this work, the best reported result on Stanford Cars was 97.1% accuracy, achieved by CMAL-Net.

3. Our approch:

Instead of starting with the most complex or specialized solutions, we took the opposite approach: Could a model that we already knew would meet our real-time and deployment constraints perform well enough on the task? Specifically, we asked whether a solid general-purpose architecture could get us close to the performance of more recent, heavier models, if trained properly.

That line of thinking led us to a paper by Ross Wightman et al., “ResNet Strikes Back: An Improved Training Procedure in Timm.” In it, Wightman makes a compelling argument: most new architectures are trained using the latest advancements and techniques but then compared against older baselines trained with outdated recipes. Wightman argues that ResNet-50, which is frequently used as a benchmark, is often not given the benefit of these modern improvements. His paper proposes a refined training procedure and shows that, when trained properly, even a vanilla ResNet-50 can achieve surprisingly strong results, including on several FGVC benchmarks.

With these constraints and goals in mind, we set out to build our own strong, reusable training procedure, one that could deliver high performance on FGVC tasks without relying on architecture-specific tricks. The idea was simple: start with a known, efficient backbone like ResNet-50 and focus entirely on improving the training pipeline rather than modifying the model itself. That way, the same recipe could later be applied to other architectures with minimal adjustments.

We began collecting ideas, techniques, and training refinements from across several sources, compounding best practices into a single, cohesive pipeline. In particular, we drew from four key resources:

  • Bag of Tricks for Image Classification with Convolutional Neural Networks (He et al.)
  • Compounding the Performance Improvements of Assembled Techniques in a Convolutional Neural Network (Lee et al.)
  • ResNet Strikes Back: An Improved Training Procedure in Timm (Wightman et al.)
  • How to Train State-of-the-Art Models Using TorchVision’s Latest Primitives (Vryniotis)

Our goal was to create a robust training pipeline that didn’t rely on model-specific tweaks. That meant focusing on techniques that are broadly applicable across architectures.

To test and validate our training pipeline, we used the Stanford Cars dataset [9], a widely used fine-grained classification benchmark that closely aligns with our real-world use case. The dataset contains 196 car categories and 16,185 images, all taken from the rear to emphasize subtle inter-class differences. The data is nearly evenly split between 8,144 training images and 8,041 testing images. To simulate our deployment scenario, where the classification model operates downstream of an object detection system, we crop each image to its annotated bounding box before training and evaluation.

While the original hosting site for the dataset is no longer available, it remains accessible via curated repositories such as Kaggle, and Huggingface. The dataset is distributed under the BSD-3-Clause license, which permits both commercial and non-commercial use. In this work, it was used solely in a research context to produce the results presented here.

Example of a cropped Image from the Stanford Cars dataset [9].

Building the Recipe

What follows is the distilled training recipe we arrived at, built through experimentation, iteration, and careful aggregation of ideas from the works mentioned above. The idea is to show that by simply applying modern training best practices, without any architecture-specific hacks, we could get a general-purpose model like ResNet-50 to perform competitively on a fine-grained benchmark.

We’ll start with a vanilla ResNet-50 trained using a basic setup and progressively introduce improvements, one step at a time.

With each technique, we’ll report:

  • The individual performance gain
  • The cumulative gain when added to the pipeline

While many of the techniques used are likely familiar, our intent is to highlight how powerful they can be when compounded intentionally. Benchmarks often obscure this by comparing new architectures trained with the latest advancements to old baselines trained with outdated recipes. Here, we want to flip that and show what’s possible with a carefully tuned recipe applied to a widely available, efficient backbone.

We also recognize that many of these techniques interact with each other. So, in practice, we tuned some combinations through greedy or grid search to account for synergies and interdependencies.

The Base Recipe:

Before diving into optimizations, we start with a clean, simple baseline.

We train a ResNet-50 model pretrained on ImageNet using the Stanford Cars dataset. Each model is trained for 600 epochs on a single RTX 4090 GPU, with early stopping based on validation accuracy using a patience of 200 epochs.

We use:

  • Nesterov Accelerated Gradient (NAG) for optimization
  • Learning rate: 0.01
  • Batch size: 32
  • Momentum: 0.9
  • Loss function: Cross-entropy

All training and validation images are cropped to their bounding boxes and resized to 224×224 pixels. We start with the same standard augmentation policy as in [5].

Here’s a summary of the base training configuration and its performance:

Model Pretrain Optimizer Learning
rate
Momentum Batch
size
ResNet50 ImageNet NAG 0.01 0.9 32
Loss function Image size Epochs Patience Augmentation Accuracy
Crossentropy
Loss
224×224 600 200 Standard 88.22

We fix the random seed across runs to ensure reproducibility and reduce variance between experiments. To assess the true effect of a change in the recipe, we follow best practices and average results over multiple runs (typically 3 to 5).

We’ll now build on top of this baseline step-by-step, introducing one technique at a time and tracking its impact on accuracy. The goal is to isolate what each component contributes and how they compound when applied together.

Large batch training:​

In mini-batch SGD, gradient descending is a random process because the examples  are randomly selected in each batch. Increasing the batch size does not change the expectation of the stochastic gradient but reduces its variance.  Using large batch size, however, may slow down the training progress. For the same number of epochs, training with a large batch size results in a model with degraded validation accuracy compared to the ones trained with smaller batch sizes. ​

He et al [5] argues that linearly increasing the learning rate with the batch size works empirically for ResNet-50 training.​

To improve both the accuracy and the speed of our training we change the batch size to 128 and the learning rate to 0.1. We add a StepLR scheduler that decays the learning rate of each parameter group by 0.1 every 30 epochs.​

Learning rate warmup:​

Since at the beginning of the training all parameters are typically random values  using a too large learning rate may result in numerical instability.​

In the warmup heuristic, we use a small learning rate at the beginning and then  switch back to the initial learning rate when the training process is stable. We use a gradual warmup strategy that increases the learning rate from 0 to the initial learning rate linearly.​

We add a linear warmup strategy for 5 epochs.

Learning rate curve. Image by author.
Model Pretrain Optimizer Learning rate Momentum
ResNet50 ImageNet NAG 0.1 0.9
Batch size Loss function Image size Epochs Patience
128 Crossentropy
Loss
224×224 600 200
Augmentation Scheduler Scheduler
step size
Scheduler
Gamma
Warmup
Method
Standard StepLR 30 0.1 Linear
Warmup
epochs
Warmup
Decay
Accuracy Incremental
Improvement
Absolute
Improvement
5 0.01 89.21 +0.99 +0.99

Trivial Augment​:

To explore the impact of stronger data augmentation, we replaced the baseline augmentation with TrivialAugment. Trivial Augment works as follows. It takes an image x and a set of augmentations A as  input. It then simply samples an augmentation from A uniformly at random and applies this augmentation to the given image x with a strength m, sampled uniformly at random from the set of possible strengths {0, . . . , 30}, and returns the augmented image. 

What makes TrivialAugment especially attractive is that it’s completely parameter-free, it doesn’t require search or tuning, making it a simple yet effective drop-in replacement that reduces experimental complexity.

While it may seem counterintuitive that such a generic and randomized strategy would outperform augmentations specifically tailored to the dataset or more sophisticated automated augmentation methods, we tried a variety of alternatives, and TrivialAugment consistently delivered strong results across runs. Its simplicity, stability, and surprisingly high effectiveness make it a compelling default choice.

A visualization of TrivialAugment [10].
Model Pretrain Optimizer Learning rate Momentum
ResNet50 ImageNet NAG 0.1 0.9
Batch size Loss function Image size Epochs Patience
128 Crossentropy
Loss
224×224 600 200
Augmentation Scheduler Scheduler
step size
Scheduler
Gamma
Warmup
Method
TrivialAugment StepLR 30 0.1 Linear
Warmup
epochs
Warmup
Decay
Accuracy Incremental
Improvement
Absolute
Improvement
5 0.01 92.66 +3.45 +4.44

Cosine Learning Rate Decay:

Next, we explored modifying the learning rate schedule. We switched to a cosine annealing strategy, which decreases the learning rate from the initial value to 0 by following the cosine function.​ A big advantage of cosine is that there are no hyper-parameters to optimize, which cuts down again our search space.

Updated learning rate curve. Image by author.
Model Pretrain Optimizer Learning rate Momentum
ResNet50 ImageNet NAG 0.1 0.9
Batch size Loss function Image size Epochs Patience
128 Crossentropy
Loss
224×224 600 200
Augmentation Scheduler Scheduler
step size
Scheduler
Gamma
Warmup
Method
TrivialAugment Cosine Linear
Warmup
epochs
Warmup
Decay
Accuracy Incremental
Improvement
Absolute
Improvement
5 0.01 93.22 +0.56 +5

Label Smoothing:

A good technique to reduce overfitting is to stop the model from becoming overconfident. This can be achieved by softening the ground truth using Label Smoothing. The idea is to change the construction of the true label to:

[q_i = begin{cases}
1 – varepsilon, & text{if } i = y, \
frac{varepsilon}{K – 1}, & text{otherwise}.
end{cases} ]

There is a single parameter which controls the degree of smoothing (the higher the stronger) that we need to specify. We used a smoothing factor of ε = 0.1, which is the standard value proposed in the original paper and widely adopted in the literature.

Interestingly, we found empirically that adding label smoothing reduced gradient variance during training. This allowed us to safely increase the learning rate without destabilizing training. As a result, we increased the initial learning rate from 0.1 to 0.4

Model Pretrain Optimizer Learning rate Momentum
ResNet50 ImageNet NAG 0.1 0.9
Batch size Loss function Image size Epochs Patience
128 Crossentropy
Loss
224×224 600 200
Augmentation Scheduler Scheduler
step size
Scheduler
Gamma
Warmup
Method
TrivialAugment StepLR 30 0.1 Linear
Warmup
epochs
Warmup
Decay
Label Smoothing Accuracy Incremental
Improvement
5 0.01 0.1 94.5 +1.28
Absolute
Improvement
+6.28

Random Erasing:

As an additional form of regularization, we introduced Random Erasing into the training pipeline. This technique randomly selects a rectangular region within an image and replaces its pixels with random values, with a fixed probability.

Often paired with Automatic Augmentation methods, it usually yields additional improvements in accuracy due to its regularization effect.​ We added Random Erasing with a probability of 0.1.

Examples of Random Erasing [11].
Model Pretrain Optimizer Learning rate Momentum
ResNet50 ImageNet NAG 0.1 0.9
Batch size Loss function Image size Epochs Patience
128 Crossentropy
Loss
224×224 600 200
Augmentation Scheduler Scheduler
step size
Scheduler
Gamma
Warmup
Method
TrivialAugment StepLR 30 0.1 Linear
Warmup
epochs
Warmup
Decay
Label Smoothing Random Erasing Accuracy
5 0.01 0.1 0.1 94.93
Incremental
Improvement
Absolute
Improvement
+0.43 +6.71

Exponential Moving Average (EMA):

Training a neural network using mini batches introduces noise and less accurate gradients when gradient descent updates the model parameters between batches. Exponential moving average is used in training deep neural networks to improve their stability and generalization.

Instead of just using the raw weights that are directly learned during training, EMA maintains a running average of the model weights which are then updated at each training step using a weighted average of the current weights and the previous EMA values.

Specifically, at each training step, the EMA weights are updated using:

[theta_{mathrm{EMA}} leftarrow alpha , theta_{mathrm{EMA}} + (1 – alpha) , theta]

where θ are the current model weights and α is a decay factor controlling how much weight is given to the past.

By evaluating the EMA weights rather than the raw ones at test time, we found improved consistency in performance across runs, especially in the later stages of training.

Model Pretrain Optimizer Learning rate Momentum
ResNet50 ImageNet NAG 0.1 0.9
Batch size Loss function Image size Epochs Patience
128 Crossentropy
Loss
224×224 600 200
Augmentation Scheduler Scheduler
step size
Scheduler
Gamma
Warmup
Method
TrivialAugment StepLR 30 0.1 Linear
Warmup
epochs
Warmup
Decay
Label Smoothing Random Erasing EMA Steps
5 0.01 0.1 0.1 32
EMA Decay Accuracy Incremental
Improvement
Absolute
Improvement
0.994 94.93 0 +6.71

We tested EMA in isolation, and found that it led to notable improvements in both training stability and validation performance. But when we integrated EMA into the full recipe alongside other techniques, it did not provide further improvement. The results appeared to plateau, suggesting that most of the gains had already been captured by the other components.

Because our goal is to develop a general-purpose training recipe rather than one overly tailored to a single dataset, we chose to keep EMA in the final setup. Its benefits may be more pronounced in other conditions, and its low overhead makes it a safe inclusion.

 Optimizations we tested but didn’t adopt:

We also explored a range of additional techniques that are commonly effective in other image classification tasks, but found that they either did not lead to significant improvements or, in some cases, slightly regressed performance on the Stanford Cars dataset:

  • Weight Decay: Adds L2 regularization to discourage large weights during training. We experimented extensively with weight decay in our use case, but it consistently regressed performance.
  • Cutmix/Mixup: Cutmix replaces random patches between images and mixes the corresponding labels. Mixup creates new training samples by linearly combining pairs of images and labels.  We tried applying either CutMix or MixUp randomly with equal probability during training, but this approach regressed results.
  • AutoAugment: Delivered strong results and competitive accuracy, but we found TrivialAugment to be better. More importantly, TrivialAugment is completely parameter-free, which cuts down our search space and simplifies tuning.
  • Alternative Optimizers and Schedulers: We experimented with a wide range of optimizers and learning rate schedules. Nesterov Accelerated Gradient (NAG) consistently gave us the best performance among optimizers, and Cosine Annealing stood out as the best scheduler, delivering strong results with no additional hyperparameters to tune.

4. Conclusion:

The graph below summarizes the improvements as we progressively built up our training recipe:

Cumulative Accuracy Improvement from Model Refinements. Image by author.

Using just a standard ResNet-50, we were able to achieve strong performance on the Stanford Cars dataset, demonstrating that careful tuning of a few simple techniques can go a long way in fine-grained classification.

However, it’s important to keep this in perspective. These results mainly show that we can train a model to distinguish between fine-grained, well-represented classes in a clean, curated dataset. The Stanford Cars dataset is nearly class-balanced, with high-quality, mostly frontal images and no major occlusion or real-world noise. It does not address challenges like long-tailed distributions, domain shift, or recognition of unseen classes.

In practice, you’ll never have a dataset that covers every car model—especially one that’s updated daily as new models appear. Real-world systems need to handle distributional shifts, open-set recognition, and imperfect inputs.

So while this served as a strong baseline and proof of concept, there was still significant work to be done to build something robust and production-ready.

References:

[1] Krause, Deng, et al. Collecting a Large-Scale Dataset of Fine-Grained Cars.

[2] Wei, et al. Fine-Grained Image Analysis with Deep Learning: A Survey.

[3] Reslan, Farou. Automatic Fine-grained Classification of Bird Species Using Deep Learning.

[4] Zhao, et al. A survey on deep learning-based fine-grained object clasiffication and semantic segmentation.

[5] He, et al. Bag of Tricks for Image Classification with Convolutional Neural Networks.

[6] Lee, et al. Compounding the Performance Improvements of Assembled Techniques in a Convolutional Neural Network.

[7] Wightman, et al. ResNet Strikes Back: An Improved Training Procedure in Timm.

[8] Vryniotis. How to Train State-of-the-Art Models Using TorchVision’s Latest Primitives.

[9] Krause et al, 3D Object Representations for Fine-Grained Catecorization.

[10] Müller, Hutter. TrivialAugment: Tuning-free Yet State-of-the-Art Data Augmentation.

[11] Zhong et al, Random Erasing Data Augmentation.

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *