Better Call CLAUSE: A Discrepancy Benchmark for Auditing LLMs Legal Reasoning Capabilities
Manan Roy Choudhury, Adithya Chandramouli, Mannan Anand, Vivek Gupta

TL;DR
This paper introduces CLAUSE, a benchmark for testing the legal reasoning robustness of LLMs using real-world contract perturbations and anomaly detection, revealing significant model weaknesses in subtle legal flaw detection.
Contribution
We present CLAUSE, a novel benchmark with a persona-driven pipeline and validation system, to systematically evaluate and analyze LLMs' legal reasoning capabilities and weaknesses.
Findings
LLMs often miss subtle legal errors in contracts.
Models struggle to justify detected flaws legally.
CLAUSE enables targeted improvement of legal reasoning in LLMs.
Abstract
The rapid integration of large language models (LLMs) into high-stakes legal work has exposed a critical gap: no benchmark exists to systematically stress-test their reliability against the nuanced, adversarial, and often subtle flaws present in real-world contracts. To address this, we introduce CLAUSE, a first-of-its-kind benchmark designed to evaluate the fragility of an LLM's legal reasoning. We study the capabilities of LLMs to detect and reason about fine-grained discrepancies by producing over 7500 real-world perturbed contracts from foundational datasets like CUAD and ContractNLI. Our novel, persona-driven pipeline generates 10 distinct anomaly categories, which are then validated against official statutes using a Retrieval-Augmented Generation (RAG) system to ensure legal fidelity. We use CLAUSE to evaluate leading LLMs' ability to detect embedded legal flaws and explain their…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsArtificial Intelligence in Law · Ethics and Social Impacts of AI · Legal Language and Interpretation
