ssToken: Self-modulated and Semantic-aware Token Selection for LLM Fine-tuning

Xiaohan Qin; Xiaoxing Wang; Ning Liao; Cancheng Zhang; Xiangdong Zhang; Mingquan Feng; Jingzhi Wang; Junchi Yan

arXiv:2510.18250·cs.AI·October 22, 2025

ssToken: Self-modulated and Semantic-aware Token Selection for LLM Fine-tuning

Xiaohan Qin, Xiaoxing Wang, Ning Liao, Cancheng Zhang, Xiangdong Zhang, Mingquan Feng, Jingzhi Wang, Junchi Yan

PDF

Open Access 3 Reviews

TL;DR

ssToken introduces a novel token selection method for LLM fine-tuning that combines self-modulation based on history models with semantic-aware importance estimation, improving efficiency and performance.

Contribution

The paper proposes ssToken, a new token selection approach that eliminates the need for reference models and combines loss-based and semantic-aware metrics for better token filtering.

Findings

01

Outperforms full-data fine-tuning with fewer tokens.

02

Self-modulated selection adapts along optimization trajectory.

03

Semantic-aware metric provides complementary importance information.

Abstract

Data quality plays a critical role in enhancing supervised fine-tuning (SFT) for large language models (LLMs), and token-level data selection has emerged as a promising direction for its fine-grained nature. Despite their strong empirical performance, existing token-level selection methods share two key limitations: (1) requiring training or accessing an additional reference model, and (2) relying solely on loss information for token selection, which cannot well preserve semantically important tokens that are not favored by loss-based metrics. To address these challenges, we propose ssToken, a Self-modulated and Semantic-aware Token Selection approach. ssToken leverages readily accessible history models to compute the per-token loss difference with the current model, which serves as a self-modulated signal that enables the model to adaptively select tokens along its optimization…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

1. ssToken does not require a pre-trained reference model, making it more cost-effective for token selection. 2. In addition to the loss-based metric, the authors also developed an attention-based metric to evaluate token importance. 3. The experiments are comprehensive and demonstrate the strong performance of ssToken.

Weaknesses

1. There is a contradiction between Equation (2) and Equation (3) that requires further explanation. In Equation (2), tokens are ranked highly if the current model’s loss on the token is large. In contrast, Equation (3) prioritizes tokens for which the current model’s loss is small. The authors should provide more insight into the rationale behind this difference. 2. Is the calculation in Equation (3) performed per training step or per epoch? How does the frequency of this calculation affect to

Reviewer 02Rating 6Confidence 2

Strengths

1. It creatively combines self-modulated signals (via the model’s historical trajectory for REL) and semantic-aware attention metrics, solving prior reliance on external models and loss-only selection. 2. It has rigorous methodology with grounded components and lightweight implementation, plus sufficient validation across models/benchmarks and ablation studies ensuring quality. 3. It follows a clear problem-method-result structure, with precise technical definitions and transparent limitations,

Weaknesses

1. It relies on manual tuning of the token selection ratio ρ, and without an adaptive mechanism to adjust ρ based on model capacity or data quality, it adds overhead for practitioners and limits generalization across diverse model families or domains. 2. The optional EMA-based update of the history model is not fully explored—experiments only use a fixed base model as the history model, leaving unaddressed whether adaptive history model updates could bring more stable guidance in large-horizon t

Reviewer 03Rating 8Confidence 2

Strengths

This paper is well written. It clearly presents the current disadvantages of traditional approaches and those statements are convincing. The experiental results are solid. It has covered multiple modern LLMs including LLaMA-3.2 & 31 and Qwen-2.5 in different scales. Also, the results are evaluated in multiple benchmarks. The ablation studies cover all necessary components.

Weaknesses

I am not in this field so I am afraid that I may not be able to identify any key weakness of this work. So, I set my score 8 (a clear acceptance). I do have a few questions just for clarification (see Questions section). They do not affect my scores.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Data Quality and Management