LegalCiteBench: Evaluating Citation Reliability in Legal Language Models
Sijia Chen, Hang Yin, Shunfan Zhou

TL;DR
LegalCiteBench is a new benchmark for evaluating legal language models' ability to accurately recover, verify, and match citations in a closed-book setting, revealing significant challenges and high rates of fabricated citations.
Contribution
The paper introduces LegalCiteBench, a comprehensive benchmark with 24K instances for studying citation-related tasks in legal LLMs, highlighting their limitations in authority generation.
Findings
Models perform poorly on citation retrieval and completion, scoring below 7/100.
High Misleading Answer Rates (MAR) over 94% indicate frequent incorrect citations.
Explicit uncertainty instructions reduce confident fabrication but do not improve correctness.
Abstract
Large language models (LLMs) are increasingly integrated into legal drafting and research workflows, where incorrect citations or fabricated precedents can cause serious professional harm. Existing legal benchmarks largely emphasize statutory reasoning, contract understanding, or general legal question answering, but they do not directly study a central common-law failure mode: when asked to provide case authorities without external grounding, models may return plausible-looking but incorrect citations or cases. We introduce LegalCiteBench, a benchmark for studying closed-book citation recovery, citation verification, and case matching in legal language models. LegalCiteBench contains approximately 24K evaluation instances constructed from 1,000 real U.S. judicial opinions from the Case Law Access Project. The benchmark covers five citation-centric tasks: citation retrieval, citation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
