Jailbreaking in the Haystack

Rishi Rajesh Shah; Chen Henry Wu; Shashwat Saxena; Ziqian Zhong; Alexander Robey; Aditi Raghunathan

arXiv:2511.04707·cs.CR·November 10, 2025

Jailbreaking in the Haystack

Rishi Rajesh Shah, Chen Henry Wu, Shashwat Saxena, Ziqian Zhong, Alexander Robey, Aditi Raghunathan

PDF

Open Access 3 Reviews

TL;DR

This paper introduces NINJA, a novel jailbreak method exploiting long-context language models by appending benign content to harmful goals, revealing significant safety vulnerabilities and outperforming prior methods in success rate and resource efficiency.

Contribution

The paper presents NINJA, a low-resource, transferable jailbreak technique that leverages context length and goal positioning to expose safety flaws in state-of-the-art language models.

Findings

01

NINJA significantly increases attack success rates on multiple models.

02

Long contexts with carefully positioned goals are vulnerable to jailbreaks.

03

Increasing context length can be more effective than more trials under fixed compute.

Abstract

Recent advances in long-context language models (LMs) have enabled million-token inputs, expanding their capabilities across complex tasks like computer-use agents. Yet, the safety implications of these extended contexts remain unclear. To bridge this gap, we introduce NINJA (short for Needle-in-haystack jailbreak attack), a method that jailbreaks aligned LMs by appending benign, model-generated content to harmful user goals. Critical to our method is the observation that the position of harmful goals play an important role in safety. Experiments on standard safety benchmark, HarmBench, show that NINJA significantly increases attack success rates across state-of-the-art open and proprietary models, including LLaMA, Qwen, Mistral, and Gemini. Unlike prior jailbreaking methods, our approach is low-resource, transferable, and less detectable. Moreover, we show that NINJA is compute-optimal…

Peer Reviews

Decision·ICLR 2026 Conference Desk Rejected Submission

Reviewer 01Rating 2Confidence 4

Strengths

1. The paper demonstrates that long, benign contexts can be an effective attack vector, which is practical and stealthy due to the use of non-malicious content. 2. The study provides a clear empirical analysis of how goal positioning within the context affects attack success.

Weaknesses

1. The paper fails to adequately distinguish its core contribution from existing long-context attacks, particularly Many-shot Jailbreaking [1]. While the authors note that NINJA uses "entirely innocuous context," this distinction is superficial. Both methods exploit long contexts to dilute safety alignment; the difference between "explicitly harmful" and "benign" examples is a matter of degree rather than a fundamental mechanistic difference. A deeper discussion of the underlying failure mode (e

Reviewer 02Rating 8Confidence 4

Strengths

1. The paper presents a clear identification and empirical validation of goal positioning as a critical safety vulnerability. This reframes positional bias from a simple capability quirk to a fundamental, exploitable flaw in safety alignment. 2. The experiment in Section 5.3 / Figure 5, which compares relevant vs. irrelevant context, is a cool and unique contribution . It proves that the attack is not merely "confusing" the model with noise, but actively "distracting" its attention with semantic

Weaknesses

The paper clearly distinguishes NINJA (relevant context, goal at start) from Cognitive Overload (distracting context, goal at end) . However, it doesn't complete the "cross-over" experiment. The authors' own findings show NINJA fails if the goal is at the end (Figure 3). To fully prove that relevance is the key differentiator, they should have also tested a "Cognitive Overload at Start" (i.e., goal at start + irrelevant context). This would isolate whether the "goal-at-start" phenomenon is unive

Reviewer 03Rating 4Confidence 4

Strengths

1. This paper introduces NINJA, a simple yet powerful jailbreak method that uses entirely benign, semantically relevant content instead of traditional adversarial prompts, revealing a new class of stealthy attacks. 2. This paper provides compelling empirical evidence that the position of harmful goals dramatically affects jailbreak success, uncovering a previously underexplored safety weakness in long-context LMs. 3. The authors evaluate their method across diverse models and agent settings, sho

Weaknesses

1. While the empirical findings on goal positioning are strong, the paper does not offer a clear theoretical framework or model-level analysis (e.g., attention distribution) to explain why early-positioned goals are more effective. 2. The experiments focus primarily on base instruct models and do not extensively evaluate NINJA against recent or state-of-the-art defense techniques. 3. The comparison is limited to only two prior jailbreak methods and more diverse baselines such as GCG and AutoDAN

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)