Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors
Chen Yueh-Han, Nitish Joshi, Yulin Chen, Maksym Andriushchenko, Rico Angell, He He

TL;DR
This paper introduces a lightweight sequential monitoring framework to detect and mitigate decomposition attacks in large language models, significantly improving safety defenses by reasoning over conversation sequences.
Contribution
The paper proposes a novel lightweight sequential monitor that effectively detects decomposition attacks and outperforms reasoning models, with reduced cost and latency.
Findings
Achieves 93% defense success rate against decomposition attacks.
Remains robust against random task injection.
Reduces monitoring cost by 90% and latency by 50%.
Abstract
Current LLM safety defenses fail under decomposition attacks, where a malicious goal is decomposed into benign subtasks that circumvent refusals. The challenge lies in the existing shallow safety alignment techniques: they only detect harm in the immediate prompt and do not reason about long-range intent, leaving them blind to malicious intent that emerges over a sequence of seemingly benign instructions. We therefore propose adding an external monitor that observes the conversation at a higher granularity. To facilitate our study of monitoring decomposition attacks, we curate the largest and most diverse dataset to date, including question-answering, text-to-image, and agentic tasks. We verify our datasets by testing them on frontier LLMs and show an 87% attack success rate on average on GPT-4o. This confirms that decomposition attack is broadly effective. Additionally, we find that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Advanced Malware Detection Techniques
