SFT-GO: Supervised Fine-Tuning with Group Optimization for Large Language Models
Gyuhak Kim, Sumiran Singh Thakur, Su Min Park, Wei Wei, Yujia Bao

TL;DR
SFT-GO introduces a token importance-based grouping and optimization method for supervised fine-tuning of large language models, leading to improved performance and robustness across benchmarks.
Contribution
It proposes a novel token grouping and weighted loss optimization approach, enhancing fine-tuning effectiveness over existing methods.
Findings
Consistently outperforms baseline methods on LLM benchmarks.
Improves model robustness across different datasets and models.
Provides theoretical analysis of convergence rate.
Abstract
Supervised fine-tuning (SFT) has become an essential step in tailoring large language models (LLMs) to align with human expectations and specific downstream tasks. However, existing SFT methods typically treat each training instance as a uniform sequence, giving equal importance to all tokens regardless of their relevance. This overlooks the fact that only a subset of tokens often contains critical, task-specific information. To address this limitation, we introduce Supervised Fine-Tuning with Group Optimization (SFT-GO), a novel approach that treats groups of tokens differently based on their importance.SFT-GO groups tokens in each sample based on their importance values and optimizes the LLM using a weighted combination of the worst-group loss and the standard cross-entropy loss. This mechanism adaptively emphasizes the most challenging token groups and guides the model to betterâŠ
Peer Reviews
Decision·Submitted to ICLR 2026
1. Connecting group DRO to LLM finetuning is simple and intuitive to understand. An added benefit that can be expected is indirectly reducing token-level spurious correlations (aka reliance on filler words, etc.). 2. The empirical results across different LLM benchmarks show an improvement in average performance for minimal change in training setup. 3. The definition of the grouping function is interesting - particularly the use of llmlingua 2 for semantics-based grouping and its effectiveness
1. The dependence on tools like llmlingua for grouping can create a suboptimal dependency on certain domains (for instance the drop in performance in Math QA in Table 2). As mentioned by the authors, any input biases in these models or the training data (in case of TF IDF) will be amplified in training. 2. Qualitative examples of the groups and semantically-rich tokens determined by their algorithm would make it easier to support these otherwise intuitive claims. 3. While few works consider t
- The paper provides compelling empirical evidence that standard supervised fine-tuning under-optimizes semantically important tokens relative to common functional tokens, motivating the need for differential treatment across tokens. - The paper is strong in its mathematical rigor and analytical proof. It successfully proves the two propositions, with fairly high complication in the proof process utilizing algebraic massages of probabilistic values and inequalities. Proposition 2 utilizes Jense
- The choice of token importance threshold đ meaningfully affects performance, and the paper shows non-monotonic behavior as this parameter varies. This suggests additional tuning is required. - The methodâs effectiveness varies significantly across grouping strategies. For example, LLMLingua-2 performs better than TF-IDF, reflecting dependence on access to an external semantic model. Thus, SFT-GOâs benefits are not inherentâthey rely on choosing a strong importance estimator. - Although the p
1. Clear and simple idea: the objective is easy to reproduce and plug into an existing SFT pipeline; training-time hyperparameters are straightforward (importance threshold, mixing weight, optional schedule). 2. Ablations exist: the paper varies the importance threshold and the mixing schedule and shows that the method does not collapse under reasonable ranges. 3. Reproducibility: datasets, backbones, and overall training setup are sufficiently specified; the method does not require invasive cod
1. Empirical impact is marginal: reported gains are small (often ~1â2 points or within noise), with several tasks showing negligible or no improvement. The paper does not present statistical significance or per-task confidence intervals, so it is hard to assess robustness. 2. Limited scope of models and data: only two relatively small LLaMA variants are tested, both within the same family and on narrow English instruction data. There are no results on larger backbones, multilingual settings, or
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
