AR-BENCH: Benchmarking Legal Reasoning with Judgment Error Detection, Classification and Correction

Yifei Li; Richong Zhang; Wanyu Tu; Zhijie Nie; Haokun Luo; Chuantao Yin; Pengchong Li

arXiv:2601.22742·cs.CL·February 2, 2026

AR-BENCH: Benchmarking Legal Reasoning with Judgment Error Detection, Classification and Correction

Yifei Li, Richong Zhang, Wanyu Tu, Zhijie Nie, Haokun Luo, Chuantao Yin, Pengchong Li

PDF

Open Access

TL;DR

This paper introduces AR-BENCH, a new benchmark dataset for legal judgment error detection, classification, and correction, highlighting current models' limitations in legal error identification and emphasizing the need for improved AI tools in legal review.

Contribution

It presents a novel task APPELLATE REVIEW and constructs AR-BENCH, a large annotated dataset for evaluating legal error detection and correction by AI models.

Findings

01

Existing models struggle to accurately identify legal application errors.

02

AR-BENCH contains 8,700 annotated decisions and 34,617 supplementary texts.

03

Evaluation reveals significant limitations in current large language models' legal reasoning abilities.

Abstract

Legal judgments may contain errors due to the complexity of case circumstances and the abstract nature of legal concepts, while existing appellate review mechanisms face efficiency pressures from a surge in case volumes. Although current legal AI research focuses on tasks like judgment prediction and legal document generation, the task of judgment review differs fundamentally in its objectives and paradigm: it centers on detecting, classifying, and correcting errors after a judgment is issued, constituting anomaly detection rather than prediction or generation. To address this research gap, we introduce a novel task APPELLATE REVIEW, aiming to assess models' diagnostic reasoning and reliability in legal practice. We also construct a novel dataset benchmark AR-BENCH, which comprises 8,700 finely annotated decisions and 34,617 supplementary corpora. By evaluating 14 large language models,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law · Topic Modeling · Multi-Agent Systems and Negotiation