Rethinking Supervised Fine-Tuning: Emphasizing Key Answer Tokens for Improved LLM Accuracy
Xiaofeng Shi, Qian Kou, Yuduo Li, Hua Zhou

TL;DR
This paper introduces SFTKey, a two-stage fine-tuning method that emphasizes key answer tokens in LLMs, significantly improving accuracy by balancing reasoning and answer focus.
Contribution
The paper proposes a novel two-stage fine-tuning approach that enhances LLM accuracy by focusing on answer-critical tokens, addressing attention imbalance in conventional SFT.
Findings
SFTKey improves average accuracy by over 5% across benchmarks.
It maintains correct output formats while enhancing answer correctness.
The method effectively balances reasoning and answer token focus.
Abstract
With the rapid advancement of Large Language Models (LLMs), the Chain-of-Thought (CoT) component has become significant for complex reasoning tasks. However, in conventional Supervised Fine-Tuning (SFT), the model could allocate disproportionately more attention to CoT sequences with excessive length. This reduces focus on the much shorter but essential Key portion-the final answer, whose correctness directly determines task success and evaluation quality. To address this limitation, we propose SFTKey, a two-stage training scheme. In the first stage, conventional SFT is applied to ensure proper output format, while in the second stage, only the Key portion is fine-tuned to improve accuracy. Extensive experiments across multiple benchmarks and model families demonstrate that SFTKey achieves an average accuracy improvement exceeding 5\% over conventional SFT, while preserving the ability…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Clear Objective and Loss Definition: The paper astutely identifies a key deficiency in standard supervised fine-tuning (SFT) for Chain-of-Thought (CoT) tasks: the disproportionate allocation of model loss between the lengthy reasoning steps and the concise final answer, which may lead to insufficient optimization of the final answer's accuracy. This problem is defined with great clarity, providing a precise target for the proposed method. 2. Simplicity and Practicality: The SFTKey-Tag metho
1. Lack of Novelty and Insufficient Literature Review: The use of "structured labels + fine-tuning" is already an active research area for enhancing model reasoning capabilities, with various implementation paths being explored. However, the paper fails to provide a sufficient comparison or discussion with recent alternative approaches that employ more complex labeling schemes or integrate reinforcement learning (e.g., arXiv:2506.20241). This omission makes it difficult to ascertain the novelty
This paper is clear and easy to follow.
1. Concerns about the methodological soundness: The method lacks a compelling rationale, and the authors provide neither theoretical analysis nor empirical justification for its design. The reported results are insufficient to demonstrate the method’s effectiveness, as the observed gains could stem from various confounding factors—such as under-training in the first stage or overfitting in the second stage—rather than the proposed two-stage scheme itself. 2. Limited evaluation on simplistic benc
The paper focuses on an important problem in fine-tuning LLMs for reasoning: the imbalance between reasoning and answer tokens. The proposed two-stage SFTKey approach is conceptually simple and easy to implement. The empirical results show moderate improvements in composite accuracy over standard SFT.
1. Incorrect figure labeling: In Figure 1, the distinction between “Training” and “Loss Computation” is misleading. The figure should illustrate whether *loss* is applied to each token rather than whether the token participates in training. 2. Dataset clarity: Line 164 only refers vaguely to benchmarks without specifying the actual training–validation partitions or whether test data were held out. This undermines reproducibility. 3. Limited novelty: The use of `<Thinking>` and `<Answer>` tags to
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Text Analysis Techniques · Multimodal Machine Learning Applications
