
TL;DR
This paper investigates the difficulty of finding natural backdoors in large language models, revealing they are both present and easily discoverable despite being hard to find with standard methods.
Contribution
It formalizes the problem of discovering natural backdoors in LLMs, proposes a greedy search method, and demonstrates their presence and recoverability.
Findings
Natural backdoors exist in LLMs and can be found with simple strategies.
The task of finding such backdoors is harder than standard jailbreaks.
Discovered token sequences are in low-probability regions, indicating implicit emergence.
Abstract
Large language models (LLMs) are known to be vulnerable to jailbreak attacks, which typically rely on carefully designed prompts containing explicit semantic structure. These attacks generally operate by fixing an adversarial instruction and optimizing small adversarial components (e.g., suffixes or prefixes). In this setting, prompt structure is fundamental for performance, and recent results show that even simple random search can achieve strong performance when combined with sophisticated prompt design. Recently, it has been observed that harmful behaviors can be elicited even without the adversarial prompt, relying solely on optimized token sequences. This suggests the existence of natural backdoors, i.e., token sequences naturally emerged during LLMs training that trigger unsafe outputs without any meaningful instruction. However, despite these observations, this setting remains…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
