Fine-grained Claim-level RAG Benchmark for Law
Souvick Das, Sallam Abualhaija, Domenico Bianculli

TL;DR
This paper introduces ClaimRAG-LAW, a detailed legal RAG benchmark dataset in English and French, enabling fine-grained evaluation of retrieval and generation performance for both experts and non-experts.
Contribution
It provides a new multilingual, multi-user legal RAG dataset and a fine-grained evaluation framework revealing current system limitations.
Findings
Legal RAG systems still hallucinate at varying rates.
Existing benchmarks lack granularity and multilingual support.
ClaimRAG-LAW enables detailed analysis of retrieval and generation in legal RAG.
Abstract
The rapid progress of large language models (LLMs) is shifting semantic search toward a question-answering paradigm, where users ask questions and LLMs generate responses. In high-stake domains such as law, retrieval-augmented generation (RAG) is commonly used to mitigate hallucinations in generated responses. Nonetheless, prior work shows that RAG systems, whether general-purpose or legal-specific, still hallucinate at varying rates, making fine-grained evaluation essential. Despite the need, existing evaluation frameworks for legal RAG systems lack the granularity required to provide detailed analysis of retrieval and generation performance separately. Moreover, current benchmarks are largely English-only and centered on legal expert queries, overlooking non-expert needs. We introduce ClaimRAG-LAW, a comprehensive dataset for legal RAG that supports French and English, targets both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
