One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models

Haoran Gu; Handing Wang; Yi Mei; Mengjie Zhang; Yaochu Jin

arXiv:2505.07167·cs.CR·January 5, 2026

One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models

Haoran Gu, Handing Wang, Yi Mei, Mengjie Zhang, Yaochu Jin

PDF

Open Access

TL;DR

This paper investigates the safety vulnerabilities of large language models, revealing safety trigger tokens as a key factor, and proposes D-STT, a minimal intervention defense method that effectively reduces harmful outputs while maintaining usability.

Contribution

The paper uncovers safety trigger tokens as a core mechanism behind safety issues in LLMs and introduces D-STT, a simple, single-token defense strategy that enhances safety with minimal impact.

Findings

01

D-STT significantly reduces harmful outputs across various jailbreak attacks.

02

D-STT maintains high model usability and incurs negligible response time overhead.

03

Safety trigger tokens are highly similar across different harmful inputs.

Abstract

Large Language Models (LLMs) have been extensively used across diverse domains, including virtual assistants, automated code generation, and scientific research. However, they remain vulnerable to jailbreak attacks, which manipulate the models into generating harmful responses despite safety alignment. Recent studies have shown that current safety-aligned LLMs undergo shallow safety alignment. In this work, we conduct an in-depth investigation into the underlying mechanism of this phenomenon and reveal that it manifests through learned ''safety trigger tokens'' that activate the model's safety patterns when paired with the specific input. Through both analysis and empirical verification, we further demonstrate the high similarity of the safety trigger tokens across different harmful inputs. Accordingly, we propose D-STT, a simple yet effective defense algorithm that identifies and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Artificial Intelligence in Healthcare and Education · Ethics and Social Impacts of AI