AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

Abhay Sheshadri; Aidan Ewart; Kai Fronsdal; Isha Gupta; Samuel R. Bowman; Sara Price; Samuel Marks; Rowan Wang

arXiv:2602.22755·cs.CL·March 11, 2026·2 cites

AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

Abhay Sheshadri, Aidan Ewart, Kai Fronsdal, Isha Gupta, Samuel R. Bowman, Sara Price, Samuel Marks, Rowan Wang

PDF

Open Access

TL;DR

AuditBench is a comprehensive benchmark for evaluating the effectiveness of various alignment auditing techniques on language models with hidden, potentially concerning behaviors, highlighting the challenges and variability in detection success.

Contribution

We introduce AuditBench, a diverse set of models with hidden behaviors and an autonomous investigator agent, to systematically evaluate and compare alignment auditing tools.

Findings

01

Tools with auxiliary models perform best in detection.

02

White-box interpretability tools are less effective than black-box tools.

03

Training methods significantly affect auditability, with synthetic training easing detection.

Abstract

We introduce AuditBench, an alignment auditing benchmark. AuditBench consists of 56 language models with implanted hidden behaviors. Each model has one of 14 concerning behaviors--such as sycophantic deference, opposition to AI regulation, or secret geopolitical loyalties--which it does not confess to when directly asked. AuditBench models are highly diverse--some are subtle, while others are overt, and we use varying training techniques both for implanting behaviors and training models not to confess. To demonstrate AuditBench's utility, we develop an investigator agent that autonomously employs a configurable set of auditing tools. By measuring investigator agent success using different tools, we can evaluate their efficacy. Notably, we observe a tool-to-agent gap, where tools that perform well in standalone non-agentic evaluations fail to translate into improved performance when used…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Topic Modeling