PLawBench: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice

Yuzhen Shi; Huanghai Liu; Yiran Hu; Gaojie Song; Xinran Xu; Yubo Ma; Tianyi Tang; Li Zhang; Qingjing Chen; Di Feng; Wenbo Lv; Weiheng Wu; Kexin Yang; Sen Yang; Wei Wang; Rongyao Shi; Yuanyang Qiu; Yuemeng Qi; Jingwen Zhang; Xiaoyu Sui; Yifan Chen; Yi Zhang; An Yang; Bowen Yu; Dayiheng Liu; Junyang Lin; Weixing Shen; Bing Zhao; Charles L.A. Clarke; Hu Wei

arXiv:2601.16669·cs.CL·January 29, 2026

PLawBench: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice

Yuzhen Shi, Huanghai Liu, Yiran Hu, Gaojie Song, Xinran Xu, Yubo Ma, Tianyi Tang, Li Zhang, Qingjing Chen, Di Feng, Wenbo Lv, Weiheng Wu, Kexin Yang, Sen Yang, Wei Wang, Rongyao Shi, Yuanyang Qiu, Yuemeng Qi, Jingwen Zhang, Xiaoyu Sui, Yifan Chen, Yi Zhang, An Yang, Bowen Yu

PDF

Open Access

TL;DR

PLawBench is a comprehensive, real-world legal benchmark that evaluates LLMs on complex legal tasks using detailed rubrics, exposing current models' limitations in legal reasoning and document generation.

Contribution

Introduces PLawBench, a realistic legal benchmark with fine-grained evaluation rubrics, to better assess LLMs' legal reasoning in practical scenarios.

Findings

01

Current LLMs perform poorly on fine-grained legal reasoning tasks.

02

PLawBench reveals significant gaps in LLMs' ability to handle real-world legal workflows.

03

Expert-aligned evaluation exposes limitations of state-of-the-art models.

Abstract

As large language models (LLMs) are increasingly applied to legal domain-specific tasks, evaluating their ability to perform legal work in real-world settings has become essential. However, existing legal benchmarks rely on simplified and highly standardized tasks, failing to capture the ambiguity, complexity, and reasoning demands of real legal practice. Moreover, prior evaluations often adopt coarse, single-dimensional metrics and do not explicitly assess fine-grained legal reasoning. To address these limitations, we introduce PLawBench, a Practical Law Benchmark designed to evaluate LLMs in realistic legal practice scenarios. Grounded in real-world legal workflows, PLawBench models the core processes of legal practitioners through three task categories: public legal consultation, practical case analysis, and legal document generation. These tasks assess a model's ability to identify…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law · Legal Language and Interpretation · Topic Modeling