Legal RAG Bench: an end-to-end benchmark for legal RAG

Abdur-Rahman Butler; Umar Butler

arXiv:2603.01710·cs.CL·March 3, 2026

Legal RAG Bench: an end-to-end benchmark for legal RAG

Abdur-Rahman Butler, Umar Butler

PDF

Open Access 2 Datasets

TL;DR

Legal RAG Bench provides a comprehensive benchmark and evaluation framework for legal retrieval-augmented generation systems, highlighting the importance of retrieval quality over language model sophistication in legal AI performance.

Contribution

Introduces a novel legal RAG benchmark with evaluation methodology, including a hierarchical error analysis, and evaluates multiple models to identify key performance drivers.

Findings

01

Retrieval quality is the main factor influencing legal RAG performance.

02

Kanon 2 Embedder significantly improves correctness and retrieval accuracy.

03

Many hallucination errors are caused by retrieval failures rather than model hallucinations.

Abstract

We introduce Legal RAG Bench, a benchmark and evaluation methodology for assessing the end-to-end performance of legal RAG systems. As a benchmark, Legal RAG Bench consists of 4,876 passages from the Victorian Criminal Charge Book alongside 100 complex, hand-crafted questions demanding expert knowledge of criminal law and procedure. Both long-form answers and supporting passages are provided. As an evaluation methodology, Legal RAG Bench leverages a full factorial design and novel hierarchical error decomposition framework, enabling apples-to-apples comparisons of the contributions of retrieval and reasoning models in RAG. We evaluate three state-of-the-art embedding models (Isaacus' Kanon 2 Embedder, Google's Gemini Embedding 001, and OpenAI's Text Embedding 3 Large) and two frontier LLMs (Gemini 3.1 Pro and GPT-5.2), finding that information retrieval is the primary driver of legal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Law · Authorship Attribution and Profiling