Reliability by design: quantifying and eliminating fabrication risk in LLMs. From generative to consultative AI: a comparative analysis in the legal domain and lessons for high-stakes knowledge bases

Alex Dantart

arXiv:2601.15476·cs.AI·January 23, 2026

Reliability by design: quantifying and eliminating fabrication risk in LLMs. From generative to consultative AI: a comparative analysis in the legal domain and lessons for high-stakes knowledge bases

Alex Dantart

PDF

Open Access

TL;DR

This paper evaluates different AI architectures for legal applications, introducing reliability metrics and demonstrating that advanced retrieval-augmented models significantly reduce hallucinations, enhancing trustworthiness in high-stakes legal AI systems.

Contribution

It introduces two new reliability metrics and provides a comprehensive evaluation of LLM architectures, highlighting the effectiveness of advanced retrieval-augmented systems in reducing fabrication errors.

Findings

01

Standalone models have high error rates (FCR > 30%).

02

Basic RAG reduces errors but still has notable misgrounding.

03

Advanced RAG achieves negligible fabrication rates below 0.2%.

Abstract

This paper examines how to make large language models reliable for high-stakes legal work by reducing hallucinations. It distinguishes three AI paradigms: (1) standalone generative models ("creative oracle"), (2) basic retrieval-augmented systems ("expert archivist"), and (3) an advanced, end-to-end optimized RAG system ("rigorous archivist"). The authors introduce two reliability metrics -False Citation Rate (FCR) and Fabricated Fact Rate (FFR)- and evaluate 2,700 judicial-style answers from 12 LLMs across 75 legal tasks using expert, double-blind review. Results show that standalone models are unsuitable for professional use (FCR above 30%), while basic RAG greatly reduces errors but still leaves notable misgrounding. Advanced RAG, using techniques such as embedding fine-tuning, re-ranking, and self-correction, reduces fabrication to negligible levels (below 0.2%). The study concludes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law · Artificial Intelligence in Healthcare and Education · Ethics and Social Impacts of AI