Home » Dynamic Fine-Tuning (DFT): Bridging the Generalization Gap in Supervised Fine-Tuning (SFT) for LLMs

Dynamic Fine-Tuning (DFT): Bridging the Generalization Gap in Supervised Fine-Tuning (SFT) for LLMs

Supervised Fine-Tuning (SFT) is a standard technique for adapting LLMs to new tasks by training them on expert demonstration datasets. It is valued for its simplicity and ability to develop expert-like behavior quickly, but often underperforms in generalization compared to reinforcement learning (RL). RL allows models to explore diverse strategies, which leads to stronger generalization. However, RL demands high computational resources, careful hyperparameter tuning, and access to reward signals, which are not always practical. Although hybrid methods combining SFT and RL exist, the question remains: can SFT itself be fundamentally improved? This is important when datasets lack negative samples or reward models are unavailable.

Existing attempts to address the challenges of SFT and RL have led to a variety of hybrid methods. A common strategy combines an initial SFT phase with subsequent RL refinement, as seen in methods like InstructGPT. Alternative methods like interleaving SFT and RL steps or Direct Preference Optimization (DPO) aim to integrate imitation and reinforcement signals more efficiently. Techniques like Negative-aware Fine-Tuning (NFT) allow models to self-improve by modeling incorrect outputs. Theoretical work has attempted to unify SFT and RL, viewing SFT as a reward-weighted or implicit RL. However, they fall short of establishing a precise mathematical equivalence between SFT and offline policy gradients.

A team of researchers from Southeast University, UC Berkeley, Shanghai Jiao Tong University, Nanyang Technological University, and Wuhan University have proposed Dynamic Fine-Tuning (DFT), a method to address the limited generalization of SFT LLMs. Through mathematical analysis, they identify that standard SFT gradients encode a flawed reward structure, limiting the model’s capacity to generalize effectively. DFT addresses this by stabilizing gradient updates through dynamic rescaling of the objective function based on the probability of each token. This modification enhances generalization across multiple benchmarks and base models. Moreover, DFT shows competitive performance in offline RL settings, offering a simpler alternative to traditional RL methods.

DFT is evaluated in a standard SFT setting, where only expert demonstration data is available, without negative samples, reward models, or verification signals. It is trained using the NuminaMath CoT dataset, which contains 860k mathematical problems and solutions. The dataset covers various sources, including Chinese high school mathematics exercises and U.S. and international mathematical olympiads. In an offline RL setting, DFT is tested using the rejection sampling fine-tuning (RFT) framework. Here, responses are generated for 10k math questions, with correct answers verified and retained, resulting in 140k training examples. Positive-negative preference pairs are also created for DPO training from the generated response.

In SFT settings, DFT outperforms standard SFT across all evaluated LLMs, and shows superior generalization and robustness on challenging benchmarks where standard SFT yields a minimal or negative impact. It exhibits better learning efficiency and faster convergence characteristics, and outperforms Importance-Weighted SFT (iw-SFT) in most scenarios. In offline RL settings, DFT outperforms both offline and online RL baselines. It scores an average of 35.43, exceeding the best offline method, RFT, by +11.46 points, and outperforms the strongest online RL algorithm, GRPO, by +3.43 points. Moreover, DFT scores 64.71 on Math500, slightly ahead of GRPO, and achieves significant gains on harder tasks like AMC23 (+7.19 over GRPO) and Minerva Math (+6.23 over GRPO).

In this work, researchers address the generalization gap between SFT and RL. They introduce Dynamic Fine-Tuning (DFT), a simple yet powerful method that dynamically reweights the SFT loss using token probabilities. This one-line modification stabilizes learning and enhances generalization, as evidenced by performance gains across mathematical reasoning benchmarks. However, evaluations of DFT are limited to math-focused datasets and models up to 7B parameters, with no testing on other domains or larger models. Moreover, this research is limited to text-only scenarios. Future work aims to extend DFT to broader benchmarks, larger models, and vision-language tasks to validate its cross-modal effectiveness.


Try it here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.


Sajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *