Token-level Data Selection for Safe LLM Fine-tuning

Yanping Li; Zhening Liu; Zijian Li; Zehong Lin; Jun Zhang

arXiv:2603.01185·cs.CL·March 3, 2026

Token-level Data Selection for Safe LLM Fine-tuning

Yanping Li, Zhening Liu, Zijian Li, Zehong Lin, Jun Zhang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces TOSS, a token-level data selection framework for fine-tuning large language models that effectively balances safety and utility by identifying and removing unsafe tokens during training.

Contribution

The paper proposes a novel token-level safety risk measurement and a progressive refinement strategy for safer LLM fine-tuning, outperforming existing sample-level methods.

Findings

01

TOSS effectively identifies unsafe tokens during fine-tuning.

02

TOSS improves downstream task performance while enhancing safety.

03

The progressive refinement strategy further enhances safety detection accuracy.

Abstract

Fine-tuning large language models (LLMs) on custom datasets has become a standard approach for adapting these models to specific domains and applications. However, recent studies have shown that such fine-tuning can lead to significant degradation in the model's safety. Existing defense methods operate at the sample level and often suffer from an unsatisfactory trade-off between safety and utility. To address this limitation, we perform a systematic token-level diagnosis of safety degradation during fine-tuning. Based on this, we propose token-level data selection for safe LLM fine-tuning (TOSS), a novel framework that quantifies the safety risk of each token by measuring the loss difference between a safety-degraded model and a utility-oriented model. This token-level granularity enables accurate identification and removal of unsafe tokens, thereby preserving valuable task-specific…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 4

Strengths

*Originality: The paper makes a highly original contribution by proposing the first token-level data selection framework specifically for safeguarding LLM fine-tuning. The fundamental insight that "the unit of safety degradation is not the sample, but the token" is strongly motivated by a systematic token-level diagnosis, distinguishing this work significantly from prior sample-level methods (e.g., SEAL). *Quality & Clarity: The proposed methodology, TOSS, is technically sound and clearly articu

Weaknesses

*Computational Cost of Reference Model Training: The core TOSS framework relies on training two full reference models ($f_{\theta^h}$ and $f_{\theta^u}$ ) via SFT, which can be computationally expensive, particularly for much larger LLMs (e.g., Llama-70B). While LoRA fine-tuning is used (Appendix B), the overall process—training two models and then performing token assessment on the entire custom dataset—is significantly more intensive than sample-level filtering that often relies on a single,

Reviewer 02Rating 4Confidence 3

Strengths

The paper shifts the focus from sample-level to token-level selection and provides compelling evidence. The KL-divergence analysis (Figure 2) shows that safety-degrading signals are concentrated in specific tokens rather than entire samples.

Weaknesses

- Evaluation over-relies on win rate (relative preference) and fails to capture absolute safety; e.g., a model can “win” 88% yet still emit harmful content. - Other safety metrics are missing: Attack Success Rate (ASR), harmful-content generation rate, and false-refusal (over-safety) rate are neither reported nor analyzed.

Reviewer 03Rating 4Confidence 4

Strengths

1. To my knowledge, this is the first work to propose a token-level data selection framework specifically for safe LLM fine-tuning. This fine-grained approach convincingly addresses a key limitation of coarse-grained, sample-level methods, which often discard valuable, task-relevant information. 2. The motivation is well-supported by a clear diagnostic analysis. The KL divergence analysis across token positions provides compelling empirical evidence for the paper's central hypothesis that safet

Weaknesses

1. The method's effectiveness is heavily dependent on the quality of the initial reference models, which introduces significant computational/data overhead and a potential "bootstrapping paradox". The framework requires two pre-trained reference models (a safety-degraded model and a utility-oriented model), which demands substantial extra resources and datasets. More critically, there is a circular dependency: to clean a custom dataset, one needs reference models trained on datasets that are ass

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Advanced Malware Detection Techniques