Talk is (Not) Cheap: A Taxonomy and Benchmark Coverage Audit for LLM Attacks
Karthik Raghu Iyer, Yazdan Jamshidi, Nicholas Bray, Alexey A. Shvets

TL;DR
This paper presents a comprehensive framework and taxonomy for auditing the coverage of large language model attack benchmarks, revealing significant gaps and fragmentation in current evaluation methods.
Contribution
It introduces a 4x6 Target x Technique matrix based on STRIDE, derived from a large taxonomy, to assess benchmark coverage and identify evaluation gaps across threat categories.
Findings
Current benchmarks cover at most 25% of the threat matrix.
Entire STRIDE categories like Service Disruption lack evaluation.
Attack corpus shows high naming fragmentation and concentration in Safety & Alignment Bypass.
Abstract
We introduce a reusable framework for auditing whether LLM attack benchmarks collectively cover the threat surface: a 46 Target Technique matrix grounded in STRIDE, constructed from a 507-leaf taxonomy -- 401 data-populated and 106 threat-model-derived leaves -- of inference-time attacks extracted from 932 arXiv security studies (2023--2026). The matrix enables benchmark-external validation -- auditing collective coverage rather than individual benchmark consistency. Applying it to six public benchmarks reveals that the three primary frameworks (HarmBench, InjecAgent, AgentDojo) occupy non-overlapping cells covering at most 25\% of the matrix, while entire STRIDE threat categories (Service Disruption, Model Internals) lack any standardized evaluation, despite published attacks in these categories achieving 46 token amplification and 96\% attack success rates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
