Gaps or Hallucinations? Gazing into Machine-Generated Legal Analysis for   Fine-grained Text Evaluations

Abe Bohan Hou; William Jurayj; Nils Holzenberger; Andrew Blair-Stanek,; Benjamin Van Durme

arXiv:2409.09947·cs.CL·September 25, 2024

Gaps or Hallucinations? Gazing into Machine-Generated Legal Analysis for Fine-grained Text Evaluations

Abe Bohan Hou, William Jurayj, Nils Holzenberger, Andrew Blair-Stanek,, Benjamin Van Durme

PDF

Open Access 1 Repo

TL;DR

This paper investigates the evaluation of machine-generated legal analysis by introducing the concept of gaps, a nuanced alternative to hallucinations, and develops a detector to identify these gaps in LLM outputs, revealing high hallucination rates.

Contribution

It introduces the notion of gaps in legal analysis, creates a taxonomy and a detector for gap categories, and provides an annotated dataset for automatic evaluation of LLM-generated legal texts.

Findings

01

The best detector achieves 67% F1 score and 80% precision.

02

Approximately 80% of LLM-generated legal analyses contain hallucinations.

03

Gaps are a nuanced form of errors that do not always imply invalidity.

Abstract

Large Language Models (LLMs) show promise as a writing aid for professionals performing legal analyses. However, LLMs can often hallucinate in this setting, in ways difficult to recognize by non-professionals and existing text evaluation metrics. In this work, we pose the question: when can machine-generated legal analysis be evaluated as acceptable? We introduce the neutral notion of gaps, as opposed to hallucinations in a strict erroneous sense, to refer to the difference between human-written and machine-generated legal analysis. Gaps do not always equate to invalid generation. Working with legal experts, we consider the CLERC generation task proposed in Hou et al. (2024b), leading to a taxonomy, a fine-grained detector for predicting gap categories, and an annotated dataset for automatic evaluation. Our best detector achieves 67% F1 score and 80% precision on the test set. Employing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bohanhou14/GapHalu
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law · Comparative and International Law Studies