Rethinking Supervised Fine-Tuning: Emphasizing Key Answer Tokens for Improved LLM Accuracy

Xiaofeng Shi; Qian Kou; Yuduo Li; Hua Zhou

arXiv:2512.21017·cs.CL·December 25, 2025

Rethinking Supervised Fine-Tuning: Emphasizing Key Answer Tokens for Improved LLM Accuracy

Xiaofeng Shi, Qian Kou, Yuduo Li, Hua Zhou

PDF

Open Access 3 Reviews

TL;DR

This paper introduces SFTKey, a two-stage fine-tuning method that emphasizes key answer tokens in LLMs, significantly improving accuracy by balancing reasoning and answer focus.

Contribution

The paper proposes a novel two-stage fine-tuning approach that enhances LLM accuracy by focusing on answer-critical tokens, addressing attention imbalance in conventional SFT.

Findings

01

SFTKey improves average accuracy by over 5% across benchmarks.

02

It maintains correct output formats while enhancing answer correctness.

03

The method effectively balances reasoning and answer token focus.

Abstract

With the rapid advancement of Large Language Models (LLMs), the Chain-of-Thought (CoT) component has become significant for complex reasoning tasks. However, in conventional Supervised Fine-Tuning (SFT), the model could allocate disproportionately more attention to CoT sequences with excessive length. This reduces focus on the much shorter but essential Key portion-the final answer, whose correctness directly determines task success and evaluation quality. To address this limitation, we propose SFTKey, a two-stage training scheme. In the first stage, conventional SFT is applied to ensure proper output format, while in the second stage, only the Key portion is fine-tuned to improve accuracy. Extensive experiments across multiple benchmarks and model families demonstrate that SFTKey achieves an average accuracy improvement exceeding 5\% over conventional SFT, while preserving the ability…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

1. Clear Objective and Loss Definition: The paper astutely identifies a key deficiency in standard supervised fine-tuning (SFT) for Chain-of-Thought (CoT) tasks: the disproportionate allocation of model loss between the lengthy reasoning steps and the concise final answer, which may lead to insufficient optimization of the final answer's accuracy. This problem is defined with great clarity, providing a precise target for the proposed method. 2. Simplicity and Practicality: The SFTKey-Tag metho

Weaknesses

1. Lack of Novelty and Insufficient Literature Review: The use of "structured labels + fine-tuning" is already an active research area for enhancing model reasoning capabilities, with various implementation paths being explored. However, the paper fails to provide a sufficient comparison or discussion with recent alternative approaches that employ more complex labeling schemes or integrate reinforcement learning (e.g., arXiv:2506.20241). This omission makes it difficult to ascertain the novelty

Reviewer 02Rating 0Confidence 4

Strengths

This paper is clear and easy to follow.

Weaknesses

1. Concerns about the methodological soundness: The method lacks a compelling rationale, and the authors provide neither theoretical analysis nor empirical justification for its design. The reported results are insufficient to demonstrate the method’s effectiveness, as the observed gains could stem from various confounding factors—such as under-training in the first stage or overfitting in the second stage—rather than the proposed two-stage scheme itself. 2. Limited evaluation on simplistic benc

Reviewer 03Rating 2Confidence 4

Strengths

The paper focuses on an important problem in fine-tuning LLMs for reasoning: the imbalance between reasoning and answer tokens. The proposed two-stage SFTKey approach is conceptually simple and easy to implement. The empirical results show moderate improvements in composite accuracy over standard SFT.

Weaknesses

1. Incorrect figure labeling: In Figure 1, the distinction between “Training” and “Loss Computation” is misleading. The figure should illustrate whether *loss* is applied to each token rather than whether the token participates in training. 2. Dataset clarity: Line 164 only refers vaguely to benchmarks without specifying the actual training–validation partitions or whether test data were held out. This undermines reproducibility. 3. Limited novelty: The use of `<Thinking>` and `<Answer>` tags to

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Text Analysis Techniques · Multimodal Machine Learning Applications