Measuring Faithfulness and Abstention: An Automated Pipeline for Evaluating LLM-Generated 3-ply Case-Based Legal Arguments

Li Zhang; Morgan Gray; Jaromir Savelka; Kevin D. Ashley

arXiv:2506.00694·cs.CL·June 4, 2025

Measuring Faithfulness and Abstention: An Automated Pipeline for Evaluating LLM-Generated 3-ply Case-Based Legal Arguments

Li Zhang, Morgan Gray, Jaromir Savelka, Kevin D. Ashley

PDF

Open Access

TL;DR

This paper presents an automated pipeline to evaluate large language models on their ability to generate faithful, relevant, and abstain from generating unsupported legal arguments, addressing reliability concerns in legal AI applications.

Contribution

It introduces a scalable, automated method for assessing faithfulness, factor utilization, and abstention in LLM-generated legal arguments, improving evaluation beyond human judgment.

Findings

01

High accuracy in avoiding hallucination on viable arguments

02

Models often underutilize relevant case factors

03

Most models fail to abstain when instructed and no facts are shared

Abstract

Large Language Models (LLMs) demonstrate potential in complex legal tasks like argument generation, yet their reliability remains a concern. Building upon pilot work assessing LLM generation of 3-ply legal arguments using human evaluation, this paper introduces an automated pipeline to evaluate LLM performance on this task, specifically focusing on faithfulness (absence of hallucination), factor utilization, and appropriate abstention. We define hallucination as the generation of factors not present in the input case materials and abstention as the model's ability to refrain from generating arguments when instructed and no factual basis exists. Our automated method employs an external LLM to extract factors from generated arguments and compares them against the ground-truth factors provided in the input case triples (current case and two precedent cases). We evaluated eight distinct…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLegal Systems and Judicial Processes · Legal Education and Practice Innovations · Artificial Intelligence in Law

MethodsSparse Evolutionary Training