Legal RAG Bench: an end-to-end benchmark for legal RAG
Abdur-Rahman Butler, Umar Butler

TL;DR
Legal RAG Bench provides a comprehensive benchmark and evaluation framework for legal retrieval-augmented generation systems, highlighting the importance of retrieval quality over language model sophistication in legal AI performance.
Contribution
Introduces a novel legal RAG benchmark with evaluation methodology, including a hierarchical error analysis, and evaluates multiple models to identify key performance drivers.
Findings
Retrieval quality is the main factor influencing legal RAG performance.
Kanon 2 Embedder significantly improves correctness and retrieval accuracy.
Many hallucination errors are caused by retrieval failures rather than model hallucinations.
Abstract
We introduce Legal RAG Bench, a benchmark and evaluation methodology for assessing the end-to-end performance of legal RAG systems. As a benchmark, Legal RAG Bench consists of 4,876 passages from the Victorian Criminal Charge Book alongside 100 complex, hand-crafted questions demanding expert knowledge of criminal law and procedure. Both long-form answers and supporting passages are provided. As an evaluation methodology, Legal RAG Bench leverages a full factorial design and novel hierarchical error decomposition framework, enabling apples-to-apples comparisons of the contributions of retrieval and reasoning models in RAG. We evaluate three state-of-the-art embedding models (Isaacus' Kanon 2 Embedder, Google's Gemini Embedding 001, and OpenAI's Text Embedding 3 Large) and two frontier LLMs (Gemini 3.1 Pro and GPT-5.2), finding that information retrieval is the primary driver of legal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Law · Authorship Attribution and Profiling
