Prompting from the bench: Large-scale pretraining is not sufficient to prepare LLMs for ordinary meaning analysis
Abhishek Purushothama, Junghyun Min, Brandon Waldon, Nathan Schneider

TL;DR
This paper critically evaluates large language models' ability to interpret legal texts in ordinary language, revealing significant robustness issues and only moderate alignment with human judgments, questioning their practical utility.
Contribution
It provides empirical evidence that current LLMs are insufficient for legal interpretation tasks, highlighting their vulnerabilities and limited correlation with human understanding.
Findings
Models show robustness failures with question format changes.
Models are only moderately correlated with human judgments.
Current LLMs are not reliable for legal interpretation in practice.
Abstract
In the U.S. judicial system, a widespread approach to legal interpretation entails assessing how a legal text would be understood by an `ordinary' speaker of the language. Recent scholarship has proposed that legal practitioners leverage large language models (LLMs) to ascertain a text's ordinary meaning. But are LLMs up to the task? As textual interpretation questions arise in spheres ranging from criminal law to civil rights, we argue it is crucial that models not be taken as authoritative without rigorous evaluation. This work offers an empirical argument against LLM-assisted interpretation as recently practiced by legal scholars and federal judges, who reasoned the large amount of data that models see in training would enable models to illuminate how people ordinarily use certain words or phrases. In controlled experiments, we find failures in robustness which cast doubt on this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
