RACC: Representation-Aware Coverage Criteria for LLM Safety Testing
Zeming Wei, Zhixin Zhang, Chengcan Wu, Yihao Zhang, Xiaokun Luan, Meng Sun

TL;DR
RACC introduces a novel, scalable coverage criterion for LLM safety testing that effectively evaluates and prioritizes test suites by focusing on safety representations, improving over neuron-level methods.
Contribution
The paper proposes RACC, a new representation-aware coverage criterion tailored for LLM safety testing, addressing limitations of existing neuron-level approaches.
Findings
RACC reliably distinguishes high-quality jailbreak test suites from invalid inputs.
RACC effectively guides test suite prioritization and attack prompt sampling.
Experiments demonstrate RACC's generalization across multiple LLMs and safety benchmarks.
Abstract
Large Language Models (LLMs) face severe safety risks from jailbreak attacks, yet current safety testing largely relies on static datasets and lacks systematic criteria to evaluate test suite quality and adequacy. While coverage criteria have proven effective for smaller neural networks, they are impractical for LLMs due to computational overhead and the entanglement of safety-critical signals with irrelevant neuron activations. To address these issues, we propose RACC (Representation-Aware Coverage Criteria), a set of coverage criteria specialized for LLM safety testing. RACC first extracts safety representations from the LLM's hidden states using a small calibration set of harmful prompts, then measures test prompts' concept activations against these directions, and finally computes coverage through six criteria assessing both individual and compositional safety concept coverage.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
