MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents
Jonathan Steinberg, Oren Gal

TL;DR
MOSAIC-Bench is a new benchmark for evaluating the vulnerability of coding agents to sequenced malicious prompts, revealing safety gaps and testing mitigation strategies across multiple models and attack chains.
Contribution
The paper introduces MOSAIC-Bench, a comprehensive benchmark with attack chains and exploit oracles, to measure and analyze compositional vulnerabilities in coding agents.
Findings
Production agents have 53-86% success rate in staged attacks.
Vulnerable outputs reduced to 0-20.4% with frontier models and defenses.
A pentester framing reduces evasion, with 88.4% attack detection on GitHub PRs.
Abstract
Coding agents often pass per-prompt safety review yet ship exploitable code when their tasks are decomposed into routine engineering tickets. The challenge is structural: existing safety alignment evaluates overt requests in isolation, leaving models blind to malicious end-states that emerge from sequenced compliance with innocuous-looking requests. We introduce MOSAIC-Bench (Malicious Objectives Sequenced As Innocuous Compliance), a benchmark of 199 three-stage attack chains paired with deterministic exploit oracles on deployed software substrates (10 web-application substrates, 31 CWE classes, 5 programming languages) that treats both exploit ground truth and downstream reviewer protocol as first-class evaluation axes. On this benchmark, nine production coding agents from Anthropic, OpenAI, Google, Moonshot, Zhipu, and Minimax compose innocuous tickets at 53-86% end-to-end ASR with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
