LLMs on Trial: Evaluating Judicial Fairness for Large Language Models

Yiran Hu; Zongyue Xue; Haitao Li; Siyuan Zheng; Qingjing Chen; Shaochun Wang; Xihan Zhang; Ning Zheng; Yun Liu; Qingyao Ai; Yiqun Liu; Charles L.A. Clarke; Weixing Shen

arXiv:2507.10852·cs.CL·August 5, 2025

LLMs on Trial: Evaluating Judicial Fairness for Large Language Models

Yiran Hu, Zongyue Xue, Haitao Li, Siyuan Zheng, Qingjing Chen, Shaochun Wang, Xihan Zhang, Ning Zheng, Yun Liu, Qingyao Ai, Yiqun Liu, Charles L.A. Clarke, Weixing Shen

PDF

Open Access 3 Reviews

TL;DR

This paper evaluates the fairness of large language models in judicial contexts, revealing widespread bias and inconsistency, and provides a comprehensive framework and toolkit for assessing and improving LLM fairness in high-stakes decision-making.

Contribution

It introduces a novel framework for measuring LLM judicial fairness, constructs a large dataset, and develops evaluation metrics and tools to analyze and address biases in LLMs used as judges.

Findings

01

LLMs exhibit pervasive inconsistency, bias, and imbalanced inaccuracy in judicial tasks.

02

Biases are more pronounced on demographic labels than on procedural or substantive labels.

03

Increasing model accuracy can worsen biases, while adjusting temperature influences fairness.

Abstract

Large Language Models (LLMs) are increasingly used in high-stakes fields where their decisions impact rights and equity. However, LLMs' judicial fairness and implications for social justice remain underexplored. When LLMs act as judges, the ability to fairly resolve judicial issues is a prerequisite to ensure their trustworthiness. Based on theories of judicial fairness, we construct a comprehensive framework to measure LLM fairness, leading to a selection of 65 labels and 161 corresponding values. Applying this framework to the judicial system, we compile an extensive dataset, JudiFair, comprising 177,100 unique case facts. To achieve robust statistical inference, we develop three evaluation metrics, inconsistency, bias, and imbalanced inaccuracy, and introduce a method to assess the overall fairness of multiple LLMs across various labels. Through experiments with 16 LLMs, we uncover…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

- Procedural fairness framework: First systematic evaluation of how procedural factors (court level, trial broadcast, litigation duration) affect LLM judicial decisions, grounded in legal theory [7,8] - Comprehensive label system: 65 labels across four categories (substance/procedure × demographic/non-demographic) represents 7× expansion over prior work [1] - Bernoulli tests for aggregate significance, fixed-effects regression with cluster-robust standard errors, and five robustness checks exce

Weaknesses

- Misleading scale framing: "177,100 unique case facts" refers to counterfactual variations of ~1,100 base documents, not distinct cases. This should be stated more transparently upfront - No intersectionality analysis: Single-axis testing misses compound marginalization (e.g., gender × ethnicity interactions), a well-established concern in fairness research [4,5] - Oversold "counterintuitive" findings: The inconsistency-bias negative correlation is a statistical artifact (noise obscures systema

Reviewer 02Rating 4Confidence 4

Strengths

1. The paper presents a systematic framework to evaluate judicial fairness in large language models, introducing a novel benchmark dataset, JudiFair, encompassing 177,100 unique case facts annotated with 65 labels and 161 label values. 2. It formulates three evaluation metrics, inconsistency, bias, and imbalanced inaccuracy and proposes a robust statistical inference methodology to assess overall fairness across multiple LLMs and various labels. 3. It conducts comprehensive experiments on 16 LLM

Weaknesses

1. The generalizability is limited by focusing exclusively on Chinese criminal law; while the authors claim the framework is transferable, cultural and legal system differences may significantly affect findings in other jurisdictions. 2. It lacks ample theoretical discussion on fairness in law and philosophy as claimed in the paper. 3. The paper does not clearly explain how "effective sample size" for weighting is calculated, making the implementation details insufficient. 4. The paper lacks con

Reviewer 03Rating 2Confidence 3

Strengths

* Data contribution and methodology. The paper provides a practical dataset and annotation framework that can be adapted to study judicial biases in other jurisdictions or legal systems. * Reproducibility. The authors open-source their codebase, making it a potentially valuable resource for follow-up work in computational law and fairness research.

Weaknesses

* Overstated novelty: 1. The claimed contributions appear incremental rather than foundational. The distinction between substantive and procedural factors was already present in LEEC; while the authors expand the label set / include counterfactual samples, in my opinion, this does not constitute a “comprehensive systematic framework” as claimed. 2. The JudiFair dataset is essentially an extension of LEEC, with more fine-grained labels and counterfactual augmentations, rather than a new dataset

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law · Comparative and International Law Studies · Law, AI, and Intellectual Property