TL;DR
This paper introduces CSTM-Bench, a comprehensive dataset and evaluation framework for cross-session threat detection in AI agents, highlighting the challenges and proposing algorithms to improve detection robustness.
Contribution
It provides a new dataset, measurement methodology, and algorithms for cross-session threat detection, addressing limitations of memoryless guardrails in AI agents.
Findings
Detection recall drops by half when moving from isolated to cross-session scenarios.
Bounded-memory coreset reader maintains higher recall across shards.
Introducing CSR_prefix as a stability metric improves detection benchmarking.
Abstract
AI-agent guardrails are memoryless: each message is judged in isolation, so an adversary who spreads a single attack across dozens of sessions slips past every session-bound detector because only the aggregate carries the payload. We make three contributions to cross-session threat detection. (1) Dataset. CSTM-Bench is 26 executable attack taxonomies classified by kill-chain stage and cross-session operation (accumulate, compose, launder, inject_on_reader), each bound to one of seven identity anchors that ground-truth "violation" as a policy predicate, plus matched Benign-pristine and Benign-hard confounders. Released on Hugging Face as intrinsec-ai/cstm-bench with two 54-scenario splits: dilution (compositional) and cross_session (12 isolation-invisible scenarios produced by a closed-loop rewriter that softens surface phrasing while preserving cross-session artefacts). (2)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
