Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors

Chen Yueh-Han; Nitish Joshi; Yulin Chen; Maksym Andriushchenko; Rico Angell; He He

arXiv:2506.10949·cs.CR·June 17, 2025

Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors

Chen Yueh-Han, Nitish Joshi, Yulin Chen, Maksym Andriushchenko, Rico Angell, He He

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces a lightweight sequential monitoring framework to detect and mitigate decomposition attacks in large language models, significantly improving safety defenses by reasoning over conversation sequences.

Contribution

The paper proposes a novel lightweight sequential monitor that effectively detects decomposition attacks and outperforms reasoning models, with reduced cost and latency.

Findings

01

Achieves 93% defense success rate against decomposition attacks.

02

Remains robust against random task injection.

03

Reduces monitoring cost by 90% and latency by 50%.

Abstract

Current LLM safety defenses fail under decomposition attacks, where a malicious goal is decomposed into benign subtasks that circumvent refusals. The challenge lies in the existing shallow safety alignment techniques: they only detect harm in the immediate prompt and do not reason about long-range intent, leaving them blind to malicious intent that emerges over a sequence of seemingly benign instructions. We therefore propose adding an external monitor that observes the conversation at a higher granularity. To facilitate our study of monitoring decomposition attacks, we curate the largest and most diverse dataset to date, including question-answering, text-to-image, and agentic tasks. We verify our datasets by testing them on frontier LLMs and show an 87% attack success rate on average on GPT-4o. This confirms that decomposition attack is broadly effective. Additionally, we find that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yuehhanchen/monitoring-decomposition-attack
noneOfficial

Datasets

YuehHanChen/DecomposedHarm
dataset· 88 dl
88 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Advanced Malware Detection Techniques