CompliBench: Benchmarking LLM Judges for Compliance Violation Detection in Dialogue Systems
Jingbo Yang, Guanyu Yao, Bairu Hou, Xinghan Yang, Nikolai Glushnev, Iwona Bialynicka-Birula, Duo Ding, Shiyu Chang

TL;DR
CompliBench is a new benchmark for evaluating LLMs' ability to detect compliance violations in dialogues, using automated data generation and challenging adversarial perturbations.
Contribution
We introduce CompliBench, a scalable benchmark with a novel data synthesis pipeline and adversarial testing for assessing LLM judges in compliance detection.
Findings
State-of-the-art LLMs struggle with compliance violation detection.
Fine-tuned small judge models outperform larger LLMs on our benchmark.
Our data generation pipeline effectively trains robust generative reward models.
Abstract
As Large Language Models (LLMs) are increasingly deployed as task-oriented agents in enterprise environments, ensuring their strict adherence to complex, domain-specific operational guidelines is critical. While utilizing an LLM-as-a-Judge is a promising solution for scalable evaluation, the reliability of these judges in detecting specific policy violations remains largely unexplored. This gap is primarily due to the lack of a systematic data generation method, which has been hindered by the extensive cost of fine-grained human annotation and the difficulty of synthesizing realistic agent violations. In this paper, we introduce CompliBench, a novel benchmark designed to evaluate the ability of LLM judges to detect and localize guideline violations in multi-turn dialogues. To overcome data scarcity, we develop a scalable, automated data generation pipeline that simulates user-agent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
