LP-Eval: Rubric and Dataset for Measuring the Quality of Legal Proposition Generation
Shanshan Xu, Johan Lindholm, Amogh Raina, Henrik Palmer Olsen, Daniel Hershcovich

TL;DR
This paper introduces LP-Eval, a comprehensive rubric and dataset for assessing the quality of legal propositions generated by large language models, emphasizing formal validity and substantive accuracy.
Contribution
It presents a novel evaluation framework co-designed with legal experts and provides a dataset of annotated LLM-generated legal propositions for improved assessment.
Findings
LLMs can generate well-formed legal propositions.
Propositions from established cases are rated higher.
Rubric-guided LLM evaluations align better with experts.
Abstract
Legal proposition generation is central to legal reasoning and doctrinal scholarship, yet remain under-examined in Legal NLP. This paper investigates the automatic generation and evaluation of legal propositions from decisions of the Court of Justice of the European Union using large language models (LLMs). We introduce LP-Eval, a three-step evaluation rubric co-designed with legal experts that decomposes legal proposition quality into formal validity and substantive dimensions. Using this rubric, we release a dataset of two experts' annotations for 100 LLM-generated legal propositions. Our results show that LLMs can generate predominantly well-formed and high-quality propositions, while expert evaluations reveal higher quality for propositions derived from well established cases than from recent ones. We further examine LLMs as evaluators and find that rubric-guided LLM judgments align…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
