Benchmarking Legal RAG: The Promise and Limits of AI Statutory Surveys

Mohamed Afane; Emaan Hariri; Derek Ouyang; Daniel E. Ho

arXiv:2603.03300·cs.CL·March 5, 2026

Benchmarking Legal RAG: The Promise and Limits of AI Statutory Surveys

Mohamed Afane, Emaan Hariri, Derek Ouyang, Daniel E. Ho

PDF

Open Access

TL;DR

This paper evaluates the performance of retrieval-augmented generation (RAG) models in legal AI, demonstrating that specialized tools like STARA outperform commercial platforms and analyzing the sources of errors to guide future system design.

Contribution

The study introduces a comprehensive benchmark for legal RAG, evaluates new tools including STARA and commercial platforms, and provides design principles for improving legal AI retrieval systems.

Findings

01

STARA achieves 83% accuracy, with an adjusted 92% considering attorney omissions.

02

Commercial tools like Westlaw AI and LexisNexis perform poorly, with 58% and 64% accuracy.

03

Error analysis reveals reasoning mistakes and retrieval failures, with some errors due to omissions by attorneys.

Abstract

Retrieval-augmented generation (RAG) offers significant potential for legal AI, yet systematic benchmarks are sparse. Prior work introduced LaborBench to benchmark RAG models based on ostensible ground truth from an exhaustive, multi-month, manual enumeration of all U.S. state unemployment insurance requirements by U.S. Department of Labor (DOL) attorneys. That prior work found poor performance of standard RAG (70% accuracy on Boolean tasks). Here, we assess three emerging tools not previously evaluated on LaborBench: the Statutory Research Assistant (STARA), a custom statutory research tool, and two commercial tools by Westlaw and LexisNexis marketing AI statutory survey capabilities. We make five main contributions. First, we show that STARA achieves substantial performance gains, boosting accuracy to 83%. Second, we show that commercial platforms fare poorly, with accuracy of 58%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law · Ethics and Social Impacts of AI · Artificial Intelligence in Healthcare and Education