LegalEval-Q: A New Benchmark for The Quality Evaluation of LLM-Generated Legal Text
Li yunhan, Wu gengshen

TL;DR
This paper introduces a new benchmark and evaluation framework for assessing the linguistic quality of legal texts generated by large language models, addressing gaps in current factual accuracy-focused metrics.
Contribution
It develops a regression-based evaluation model for legal text quality, creates a specialized legal question set, and analyzes 49 LLMs, revealing key insights about model performance and limitations.
Findings
Model quality plateaus at 14 billion parameters.
Engineering choices like quantization have minimal impact.
Reasoning models outperform base architectures.
Abstract
As large language models (LLMs) are increasingly used in legal applications, current evaluation benchmarks tend to focus mainly on factual accuracy while largely neglecting important linguistic quality aspects such as clarity, coherence, and terminology. To address this gap, we propose three steps: First, we develop a regression model to evaluate the quality of legal texts based on clarity, coherence, and terminology. Second, we create a specialized set of legal questions. Third, we analyze 49 LLMs using this evaluation framework. Our analysis identifies three key findings: First, model quality levels off at 14 billion parameters, with only a marginal improvement of noted at 72 billion parameters. Second, engineering choices such as quantization and context length have a negligible impact, as indicated by statistical significance thresholds above 0.016. Third, reasoning models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Law
MethodsFocus · Balanced Selection · Sparse Evolutionary Training
