OpenExempt: A Diagnostic Benchmark for Legal Reasoning and a Framework for Creating Custom Benchmarks on Demand

Sergio Servantez; Sarah B. Lawsky; Rajiv Jain; Daniel W. Linna Jr.; Kristian Hammond

arXiv:2601.13183·cs.CL·January 21, 2026

OpenExempt: A Diagnostic Benchmark for Legal Reasoning and a Framework for Creating Custom Benchmarks on Demand

Sergio Servantez, Sarah B. Lawsky, Rajiv Jain, Daniel W. Linna Jr., Kristian Hammond

PDF

Open Access 1 Datasets

TL;DR

OpenExempt introduces a flexible framework and benchmark for detailed evaluation of legal reasoning in language models, enabling targeted probing of specific reasoning skills through dynamically generated tasks.

Contribution

It presents a novel framework that uses symbolic representations to generate customizable legal reasoning tasks, along with a comprehensive benchmark for diagnostic evaluation.

Findings

01

Models show performance cliffs with longer reasoning paths.

02

Obfuscating statements significantly impact model accuracy.

03

Benchmark covers diverse reasoning skills with 9,765 samples.

Abstract

Reasoning benchmarks have played a crucial role in the progress of language models. Yet rigorous evaluation remains a significant challenge as static question-answer pairs provide only a snapshot of performance, compressing complex behavior into a single accuracy metric. This limitation is especially true in complex, rule-bound domains such as law, where existing benchmarks are costly to build and ill suited for isolating specific failure modes. To address this, we introduce OpenExempt, a framework and benchmark for diagnostic evaluation of legal reasoning. The OpenExempt Framework uses expert-crafted symbolic representations of U.S. Bankruptcy Code statutes to dynamically generate a large space of natural language reasoning tasks and their machine-computable solutions on demand. This gives users fine-grained control over task complexity and scope, allowing individual reasoning skills…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

SergioServantez/OpenExempt
dataset· 65 dl
65 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Law · Ethics and Social Impacts of AI