One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models
Haoran Gu, Handing Wang, Yi Mei, Mengjie Zhang, Yaochu Jin

TL;DR
This paper investigates the safety vulnerabilities of large language models, revealing safety trigger tokens as a key factor, and proposes D-STT, a minimal intervention defense method that effectively reduces harmful outputs while maintaining usability.
Contribution
The paper uncovers safety trigger tokens as a core mechanism behind safety issues in LLMs and introduces D-STT, a simple, single-token defense strategy that enhances safety with minimal impact.
Findings
D-STT significantly reduces harmful outputs across various jailbreak attacks.
D-STT maintains high model usability and incurs negligible response time overhead.
Safety trigger tokens are highly similar across different harmful inputs.
Abstract
Large Language Models (LLMs) have been extensively used across diverse domains, including virtual assistants, automated code generation, and scientific research. However, they remain vulnerable to jailbreak attacks, which manipulate the models into generating harmful responses despite safety alignment. Recent studies have shown that current safety-aligned LLMs undergo shallow safety alignment. In this work, we conduct an in-depth investigation into the underlying mechanism of this phenomenon and reveal that it manifests through learned ''safety trigger tokens'' that activate the model's safety patterns when paired with the specific input. Through both analysis and empirical verification, we further demonstrate the high similarity of the safety trigger tokens across different harmful inputs. Accordingly, we propose D-STT, a simple yet effective defense algorithm that identifies and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Artificial Intelligence in Healthcare and Education · Ethics and Social Impacts of AI
