TL;DR
This paper benchmarks large language models on a new legal precedent classification dataset, introduces a severity-based error metric, and provides insights into model performance on nuanced legal NLP tasks.
Contribution
It presents a new expert-annotated dataset, a novel evaluation metric, and benchmarks multiple LLMs on legal precedent classification tasks.
Findings
Gemini 2.5 Flash achieved 79.1% accuracy on high-level classification.
GPT-5-mini achieved 67.7% accuracy on fine-grained schema.
The new metric better captures the practical impact of classification errors.
Abstract
Automating the classification of negative treatment in legal precedent is a critical yet nuanced NLP task where misclassification carries significant risk. To address the shortcomings of standard accuracy, this paper introduces a more robust evaluation framework. We benchmark modern Large Language Models on a new, expert-annotated dataset of 239 real-world legal citations and propose a novel Average Severity Error metric to better measure the practical impact of classification errors. Our experiments reveal a performance split. Google's Gemini 2.5 Flash achieved the highest accuracy on a high-level classification task (79.1%), while OpenAI's GPT-5-mini was the top performer on the more complex fine-grained schema (67.7%). This work establishes a crucial baseline, provides a new context-rich dataset, and introduces an evaluation metric tailored to the demands of this complex legal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
