DeonticBench: A Benchmark for Reasoning over Rules

Guangyao Dou; Luis Brena; Akhil Deo; William Jurayj; Jingyu Zhang; Nils Holzenberger; Benjamin Van Durme

arXiv:2604.04443·cs.CL·April 7, 2026

DeonticBench: A Benchmark for Reasoning over Rules

Guangyao Dou, Luis Brena, Akhil Deo, William Jurayj, Jingyu Zhang, Nils Holzenberger, Benjamin Van Durme

PDF

1 Repo 1 Datasets

TL;DR

DEONTICBENCH is a comprehensive benchmark designed to evaluate large language models' ability to perform complex deontic reasoning over real-world legal and policy rules, incorporating both language-based and symbolic approaches.

Contribution

It introduces a large, diverse set of tasks for deontic reasoning, along with a framework for combining language models with symbolic computation, and provides baseline results and analysis.

Findings

01

Best models achieve only around 44-46% accuracy on benchmark tasks.

02

Training improves symbolic program generation but does not reliably solve tasks.

03

Benchmark covers real-world legal and policy domains with over 6,200 tasks.

Abstract

Reasoning with complex, context-specific rules remains challenging for large language models (LLMs). In legal and policy settings, this manifests as deontic reasoning: reasoning about obligations, permissions, and prohibitions under explicit rules. While many recent benchmarks emphasize short-context mathematical reasoning, fewer focus on long-context, high-stakes deontic reasoning. To address this gap, we introduce DEONTICBENCH, a benchmark of 6,232 tasks across U.S. federal taxes, airline baggage policies, U.S. immigration administration, and U.S. state housing law. These tasks can be approached in multiple ways, including direct reasoning in language or with the aid of symbolic computation. Besides free-form chain-of-thought reasoning, DEONTICBENCH enables an optional solver-based workflow in which models translate statutes and case facts into executable Prolog, leading to formal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

guangyaodou/DeonticBench
github

Datasets

gydou/DeonticBench
dataset· 262 dl
262 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.