FragBench: Cross-Session Attacks Hidden in Benign-Looking Fragments
Astha Mehta, Niruthiha Selvanayagam, Cedric Lam, Hengxu Li, Phuc-Nguyen Nguyen, Raymond Lee, Olivia McGoffin, My (Isabella) Luong, Arthur Coll\'e, Jamie Johnson, David Williams-King, Linh Le

TL;DR
FragBench introduces a benchmark for detecting cross-session malicious prompts in LLMs, emphasizing the importance of modeling interaction graphs over isolated prompt analysis.
Contribution
It presents a new benchmark derived from real cyber incidents, with tasks for adversarial rewriters and user-level detectors, and demonstrates the effectiveness of graph-based models.
Findings
Graph-based detectors achieve F1 scores of 0.88-0.96.
Single-turn safety judges perform near chance on cross-session attacks.
Cross-session interaction modeling is crucial for LLM safety.
Abstract
An attacker can split a malicious goal into sub-prompts that each look benign on their own and only become harmful in combination. Existing LLM safety benchmarks evaluate prompts one at a time, or across turns of a single chat, and so do not look for a malicious signal spread across separate sessions with no shared context. We build FragBench, a benchmark drawn from 24 real-world cyber-incident campaigns, which keeps the full attack trail: the multi-fragment kill chain, the per-fragment safety-judge verdicts, sandboxed execution traces, and a matched set of benign cover sessions. FragBench splits this trail into two paired tasks: an adversarial rewriter that hardens fragments against a single-turn safety judge (FragBench Attack), and a graph-based user-level detector trained on the resulting interactions (FragBench Defense). The single-turn judge is near chance on the released corpus by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
