Measuring Faithfulness and Abstention: An Automated Pipeline for Evaluating LLM-Generated 3-ply Case-Based Legal Arguments
Li Zhang, Morgan Gray, Jaromir Savelka, Kevin D. Ashley

TL;DR
This paper presents an automated pipeline to evaluate large language models on their ability to generate faithful, relevant, and abstain from generating unsupported legal arguments, addressing reliability concerns in legal AI applications.
Contribution
It introduces a scalable, automated method for assessing faithfulness, factor utilization, and abstention in LLM-generated legal arguments, improving evaluation beyond human judgment.
Findings
High accuracy in avoiding hallucination on viable arguments
Models often underutilize relevant case factors
Most models fail to abstain when instructed and no facts are shared
Abstract
Large Language Models (LLMs) demonstrate potential in complex legal tasks like argument generation, yet their reliability remains a concern. Building upon pilot work assessing LLM generation of 3-ply legal arguments using human evaluation, this paper introduces an automated pipeline to evaluate LLM performance on this task, specifically focusing on faithfulness (absence of hallucination), factor utilization, and appropriate abstention. We define hallucination as the generation of factors not present in the input case materials and abstention as the model's ability to refrain from generating arguments when instructed and no factual basis exists. Our automated method employs an external LLM to extract factors from generated arguments and compares them against the ground-truth factors provided in the input case triples (current case and two precedent cases). We evaluated eight distinct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLegal Systems and Judicial Processes · Legal Education and Practice Innovations · Artificial Intelligence in Law
MethodsSparse Evolutionary Training
