ShallowJail: Steering Jailbreaks against Large Language Models
Shang Liu, Hanyu Pei, Zeyan Liu

TL;DR
ShallowJail presents a new method to attack aligned large language models by manipulating initial tokens, revealing vulnerabilities in current safety measures and highlighting the need for more robust alignment techniques.
Contribution
The paper introduces ShallowJail, a novel, efficient attack exploiting shallow alignment in LLMs by token manipulation, demonstrating significant safety degradation in state-of-the-art models.
Findings
ShallowJail effectively misleads LLMs into harmful outputs.
It significantly reduces the safety of current LLMs.
The attack is efficient and requires minimal resources.
Abstract
Large Language Models(LLMs) have been successful in numerous fields. Alignment has usually been applied to prevent them from harmful purposes. However, aligned LLMs remain vulnerable to jailbreak attacks that deliberately mislead them into producing harmful outputs. Existing jailbreaks are either black-box, using carefully crafted, unstealthy prompts, or white-box, requiring resource-intensive computation. In light of these challenges, we introduce ShallowJail, a novel attack that exploits shallow alignment in LLMs. ShallowJail can misguide LLMs' responses by manipulating the initial tokens during inference. Through extensive experiments, we demonstrate the effectiveness of ShallowJail, which substantially degrades the safety of state-of-the-art LLM responses. Our code is available at https://github.com/liuup/ShallowJail.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)
