Validate Your Authority: Benchmarking LLMs on Multi-Label Precedent Treatment Classification

M. Mikail Demir; M. Abdullah Canbaz

arXiv:2605.17691·cs.CL·May 19, 2026

Validate Your Authority: Benchmarking LLMs on Multi-Label Precedent Treatment Classification

M. Mikail Demir, M. Abdullah Canbaz

PDF

1 Video

TL;DR

This paper benchmarks large language models on a new legal precedent classification dataset, introduces a severity-based error metric, and provides insights into model performance on nuanced legal NLP tasks.

Contribution

It presents a new expert-annotated dataset, a novel evaluation metric, and benchmarks multiple LLMs on legal precedent classification tasks.

Findings

01

Gemini 2.5 Flash achieved 79.1% accuracy on high-level classification.

02

GPT-5-mini achieved 67.7% accuracy on fine-grained schema.

03

The new metric better captures the practical impact of classification errors.

Abstract

Automating the classification of negative treatment in legal precedent is a critical yet nuanced NLP task where misclassification carries significant risk. To address the shortcomings of standard accuracy, this paper introduces a more robust evaluation framework. We benchmark modern Large Language Models on a new, expert-annotated dataset of 239 real-world legal citations and propose a novel Average Severity Error metric to better measure the practical impact of classification errors. Our experiments reveal a performance split. Google's Gemini 2.5 Flash achieved the highest accuracy on a high-level classification task (79.1%), while OpenAI's GPT-5-mini was the top performer on the more complex fine-grained schema (67.7%). This work establishes a crucial baseline, provides a new context-rich dataset, and introduces an evaluation metric tailored to the demands of this complex legal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Validate Your Authority: Benchmarking LLMs on Multi-Label Precedent Treatment Classification· underline