SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task Linkage
Xiaoning Dong, Wenbo Hu, Wei Xu, Tianxing He

TL;DR
This paper introduces SATA, a new paradigm for bypassing LLM safety measures using simple assistive tasks to encode malicious intent, achieving high success rates in jailbreak experiments.
Contribution
SATA is a novel jailbreak method that links masked queries with assistive tasks, outperforming existing approaches in effectiveness and efficiency.
Findings
Achieves 85% attack success rate with MLM assistive task.
Outperforms baselines significantly on AdvBench dataset.
Effectively encodes malicious intent using simple assistive tasks.
Abstract
Large language models (LLMs) have made significant advancements across various tasks, but their safety alignment remain a major concern. Exploring jailbreak prompts can expose LLMs' vulnerabilities and guide efforts to secure them. Existing methods primarily design sophisticated instructions for the LLM to follow, or rely on multiple iterations, which could hinder the performance and efficiency of jailbreaks. In this work, we propose a novel jailbreak paradigm, Simple Assistive Task Linkage (SATA), which can effectively circumvent LLM safeguards and elicit harmful responses. Specifically, SATA first masks harmful keywords within a malicious query to generate a relatively benign query containing one or multiple [MASK] special tokens. It then employs a simple assistive task such as a masked language model task or an element lookup by position task to encode the semantics of the masked…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDigital and Cyber Forensics · Privacy-Preserving Technologies in Data · Data Quality and Management
