One Word at a Time: Incremental Completion Decomposition Breaks LLM Safety

Samee Arif; Naihao Deng; Zhijing Jin; Rada Mihalcea

arXiv:2604.25921·cs.CL·April 30, 2026

One Word at a Time: Incremental Completion Decomposition Breaks LLM Safety

Samee Arif, Naihao Deng, Zhijing Jin, Rada Mihalcea

PDF

TL;DR

This paper introduces Incremental Completion Decomposition (ICD), a novel trajectory-based jailbreak method that effectively bypasses LLM safety mechanisms by eliciting sequences of related single-word prompts, with demonstrated superior attack success rates.

Contribution

The paper presents ICD, a new incremental approach to jailbreak LLMs, including variants and theoretical insights, significantly improving attack success rates over existing methods.

Findings

01

ICD achieves higher Attack Success Rate (ASR) on benchmark datasets.

02

Variants of ICD, including manual and model-generated continuations, outperform prior methods.

03

Mechanistic evidence shows successful attacks suppress safety-related representations.

Abstract

Large Language Models (LLMs) are trained to refuse harmful requests, yet they remain vulnerable to jailbreak attacks that exploit weaknesses in conversational safety mechanisms. We introduce Incremental Completion Decomposition (ICD), a trajectory-based jailbreak strategy that elicits a sequence of single-word continuations related to a malicious request before eliciting the full response. In addition, we propose variants of ICD by manually picking or model-generating the one-word continuation, as well as prefilling when eliciting the full model response in the final step. We systematically evaluate these variants across a broad set of model families, demonstrating superior Attack Success Rate (ASR) on AdvBench, JailbreakBench, and StrongREJECT compared to existing methods. In addition, we provide a theoretical account of why ICD is effective and present mechanistic evidence that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.