LegalEval-Q: A New Benchmark for The Quality Evaluation of LLM-Generated Legal Text

Li yunhan; Wu gengshen

arXiv:2505.24826·cs.CL·November 11, 2025

LegalEval-Q: A New Benchmark for The Quality Evaluation of LLM-Generated Legal Text

Li yunhan, Wu gengshen

PDF

Open Access 1 Repo 4 Models

TL;DR

This paper introduces a new benchmark and evaluation framework for assessing the linguistic quality of legal texts generated by large language models, addressing gaps in current factual accuracy-focused metrics.

Contribution

It develops a regression-based evaluation model for legal text quality, creates a specialized legal question set, and analyzes 49 LLMs, revealing key insights about model performance and limitations.

Findings

01

Model quality plateaus at 14 billion parameters.

02

Engineering choices like quantization have minimal impact.

03

Reasoning models outperform base architectures.

Abstract

As large language models (LLMs) are increasingly used in legal applications, current evaluation benchmarks tend to focus mainly on factual accuracy while largely neglecting important linguistic quality aspects such as clarity, coherence, and terminology. To address this gap, we propose three steps: First, we develop a regression model to evaluate the quality of legal texts based on clarity, coherence, and terminology. Second, we create a specialized set of legal questions. Third, we analyze 49 LLMs using this evaluation framework. Our analysis identifies three key findings: First, model quality levels off at 14 billion parameters, with only a marginal improvement of $2.7%$ noted at 72 billion parameters. Second, engineering choices such as quantization and context length have a negligible impact, as indicated by statistical significance thresholds above 0.016. Third, reasoning models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lyxx3rd/legaleval-q
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law

MethodsFocus · Balanced Selection · Sparse Evolutionary Training