A Reasoning-Focused Legal Retrieval Benchmark

Lucia Zheng; Neel Guha; Javokhir Arifov; Sarah Zhang; Michal Skreta,; Christopher D. Manning; Peter Henderson; Daniel E. Ho

arXiv:2505.03970·cs.CL·May 8, 2025

A Reasoning-Focused Legal Retrieval Benchmark

Lucia Zheng, Neel Guha, Javokhir Arifov, Sarah Zhang, Michal Skreta,, Christopher D. Manning, Peter Henderson, Daniel E. Ho

PDF

1 Datasets

TL;DR

This paper introduces two realistic legal RAG benchmarks, Bar Exam QA and Housing Statute QA, to evaluate retrieval-augmented LLMs in complex legal question-answering tasks, highlighting ongoing challenges.

Contribution

It presents novel legal RAG benchmarks that mimic real-world legal research, facilitating evaluation of retrieval-augmented LLMs in legal contexts.

Findings

01

Existing retriever pipelines perform poorly on legal RAG tasks.

02

Legal RAG remains a challenging application for current models.

03

Benchmarks reflect real-world legal research complexity.

Abstract

As the legal community increasingly examines the use of large language models (LLMs) for various legal applications, legal AI developers have turned to retrieval-augmented LLMs ("RAG" systems) to improve system performance and robustness. An obstacle to the development of specialized RAG systems is the lack of realistic legal RAG benchmarks which capture the complexity of both legal retrieval and downstream legal question-answering. To address this, we introduce two novel legal RAG benchmarks: Bar Exam QA and Housing Statute QA. Our tasks correspond to real-world legal research tasks, and were produced through annotation processes which resemble legal research. We describe the construction of these benchmarks and the performance of existing retriever pipelines. Our results suggest that legal RAG remains a challenging application, thus motivating future research.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

isaacus/mteb-barexam-qa
dataset· 85 dl
85 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Warmup With Linear Decay · Dropout · Layer Normalization · Byte Pair Encoding · Attention Dropout · Softmax · Residual Connection · WordPiece