Not All Tokens Are Created Equal: Query-Efficient Jailbreak Fuzzing for LLMs
Wenyu Chen, Xiangtao Meng, Chuanchao Zang, Li Wang, Xinyu Gao, Jianing Wang, Peng Zhan, Zheng Li, Shanqing Guo

TL;DR
This paper introduces TriageFuzz, a token-aware jailbreak fuzzing framework for LLMs that uses surrogate models to efficiently identify prompt regions causing refusals, significantly reducing query costs while maintaining high attack success rates.
Contribution
It presents a novel token-level analysis of refusal behavior and a surrogate-based, token-aware fuzzing approach that improves attack efficiency against LLMs.
Findings
Achieves 90% attack success rate with 70% fewer queries.
Outperforms existing methods under strict query budgets.
Demonstrates cross-model consistency in refusal tendencies.
Abstract
Large Language Models(LLMs) are widely deployed, yet are vulnerable to jailbreak prompts that elicit policy-violating outputs. Although prior studies have uncovered these risks, they typically treat all tokens as equally important during prompt mutation, overlooking the varying contributions of individual tokens to triggering model refusals. Consequently, these attacks introduce substantial redundant searching under query-constrained scenarios, reducing attack efficiency and hindering comprehensive vulnerability assessment. In this work, we conduct a token-level analysis of refusal behavior and observe that token contributions are highly skewed rather than uniform. Moreover, we find strong cross-model consistency in refusal tendencies, enabling the use of a surrogate model to estimate token-level contributions to the target model's refusals. Motivated by these findings, we propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Security and Verification in Computing
