Gradients Must Earn Their Influence: Unifying SFT with Generalized Entropic Objectives
Zecheng Wang, Deyuan Liu, Chunshan Li, Yupeng Zhang, Zhengyun Zhao, Dianhui Chu, Bingning Wang, Dianbo Sui

TL;DR
This paper introduces a unified framework for supervised fine-tuning that adaptively balances learning from uncertain and confident predictions, improving model robustness and performance.
Contribution
It unifies token-level SFT objectives within a generalized deformed-log family and proposes DEFT, a parameter-free method that dynamically modulates trust in predictions based on entropy.
Findings
DEFT outperforms existing methods in balancing exploration and exploitation.
The universal gate-error gradient structure provides insights into model trust dynamics.
Experimental results show improved robustness and accuracy across tasks.
Abstract
Standard negative log-likelihood (NLL) for Supervised Fine-Tuning (SFT) applies uniform token-level weighting. This rigidity creates a two-fold failure mode: (i) overemphasizing low-probability targets can amplify gradients on noisy supervision and disrupt robust priors, and (ii) uniform weighting provides weak sharpening when the model is already confident. Existing methods fail to resolve the resulting plasticity--stability dilemma, often suppressing necessary learning signals alongside harmful ones. To address this issue, we unify token-level SFT objectives within a generalized deformed-log family and expose a universal gate error gradient structure, where the gate controls how much the model trusts its current prediction. By employing the Cayley transform, we map the model's continuously evolving uncertainty onto a continuous focus trajectory, which enables seamless…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
