LLM-as-a-Judge is Bad, Based on AI Attempting the Exam Qualifying for the Member of the Polish National Board of Appeal

Micha{\l} Karp; Anna Kubaszewska; Magdalena Kr\'ol; Robert Kr\'ol; Aleksander Smywi\'nski-Pohl; Mateusz Szyma\'nski; Witold Wydma\'nski

arXiv:2511.04205·cs.CL·November 7, 2025

LLM-as-a-Judge is Bad, Based on AI Attempting the Exam Qualifying for the Member of the Polish National Board of Appeal

Micha{\l} Karp, Anna Kubaszewska, Magdalena Kr\'ol, Robert Kr\'ol, Aleksander Smywi\'nski-Pohl, Mateusz Szyma\'nski, Witold Wydma\'nski

PDF

Open Access

TL;DR

This paper empirically evaluates whether current large language models can pass a legal exam and serve as judges, finding they fall short in practical judgment and legal reasoning, thus cannot replace human legal experts yet.

Contribution

It provides the first comprehensive assessment of LLMs in a real-world legal examination and highlights their limitations in legal reasoning and judgment.

Findings

01

LLMs achieved high scores on knowledge tests but failed in practical written tasks.

02

Model evaluations often diverged from official legal examiners.

03

Current LLMs cannot replace human judges in legal adjudication.

Abstract

This study provides an empirical assessment of whether current large language models (LLMs) can pass the official qualifying examination for membership in Poland's National Appeal Chamber (Krajowa Izba Odwo{\l}awcza). The authors examine two related ideas: using LLM as actual exam candidates and applying the 'LLM-as-a-judge' approach, in which model-generated answers are automatically evaluated by other models. The paper describes the structure of the exam, which includes a multiple-choice knowledge test on public procurement law and a written judgment, and presents the hybrid information recovery and extraction pipeline built to support the models. Several LLMs (including GPT-4.1, Claude 4 Sonnet and Bielik-11B-v2.6) were tested in closed-book and various Retrieval-Augmented Generation settings. The results show that although the models achieved satisfactory scores in the knowledge…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law · Ethics and Social Impacts of AI · Legal Language and Interpretation