SQL Injection Jailbreak: A Structural Disaster of Large Language Models
Jiawei Zhao, Kejiang Chen, Weiming Zhang, Nenghai Yu

TL;DR
This paper introduces SQL Injection Jailbreak (SIJ), a novel method exploiting prompt construction vulnerabilities in large language models to induce harmful outputs, revealing a new security weakness and proposing an effective defense.
Contribution
The paper presents SIJ, a new prompt-based jailbreak technique for LLMs, and proposes a simple adaptive defense method called Self-Reminder-Key.
Findings
Near 100% success rate on open-source models
Over 85% success rate on closed-source models
Effective defense demonstrated with Self-Reminder-Key
Abstract
Large Language Models (LLMs) are susceptible to jailbreak attacks that can induce them to generate harmful content. Previous jailbreak methods primarily exploited the internal properties or capabilities of LLMs, such as optimization-based jailbreak methods and methods that leveraged the model's context-learning abilities. In this paper, we introduce a novel jailbreak method, SQL Injection Jailbreak (SIJ), which targets the external properties of LLMs, specifically, the way LLMs construct input prompts. By injecting jailbreak information into user prompts, SIJ successfully induces the model to output harmful content. For open-source models, SIJ achieves near 100% attack success rates on five well-known LLMs on the AdvBench and HEx-PHI, while incurring lower time costs compared to previous methods. For closed-source models, SIJ achieves an average attack success rate over 85% across five…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital and Cyber Forensics
