ShallowJail: Steering Jailbreaks against Large Language Models

Shang Liu; Hanyu Pei; Zeyan Liu

arXiv:2602.07107·cs.CR·February 17, 2026

ShallowJail: Steering Jailbreaks against Large Language Models

Shang Liu, Hanyu Pei, Zeyan Liu

PDF

Open Access

TL;DR

ShallowJail presents a new method to attack aligned large language models by manipulating initial tokens, revealing vulnerabilities in current safety measures and highlighting the need for more robust alignment techniques.

Contribution

The paper introduces ShallowJail, a novel, efficient attack exploiting shallow alignment in LLMs by token manipulation, demonstrating significant safety degradation in state-of-the-art models.

Findings

01

ShallowJail effectively misleads LLMs into harmful outputs.

02

It significantly reduces the safety of current LLMs.

03

The attack is efficient and requires minimal resources.

Abstract

Large Language Models(LLMs) have been successful in numerous fields. Alignment has usually been applied to prevent them from harmful purposes. However, aligned LLMs remain vulnerable to jailbreak attacks that deliberately mislead them into producing harmful outputs. Existing jailbreaks are either black-box, using carefully crafted, unstealthy prompts, or white-box, requiring resource-intensive computation. In light of these challenges, we introduce ShallowJail, a novel attack that exploits shallow alignment in LLMs. ShallowJail can misguide LLMs' responses by manipulating the initial tokens during inference. Through extensive experiments, we demonstrate the effectiveness of ShallowJail, which substantially degrades the safety of state-of-the-art LLM responses. Our code is available at https://github.com/liuup/ShallowJail.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)