From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks

Andreas Stephan; Dawei Zhu; Matthias A{\ss}enmacher; Xiaoyu Shen; Benjamin Roth

arXiv:2409.04168·cs.CL·May 14, 2025

From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks

Andreas Stephan, Dawei Zhu, Matthias A{\ss}enmacher, Xiaoyu Shen, Benjamin Roth

PDF

Open Access

TL;DR

This paper investigates the effectiveness of large language models as judges for mathematical reasoning tasks, revealing their strengths in identifying better models but limitations in improving overall task performance.

Contribution

It provides a detailed analysis of LLM judges on mathematical tasks and demonstrates that simple features can predict their judgments with high accuracy.

Findings

01

Easy samples are easier to judge correctly.

02

Judgment performance correlates with model quality.

03

LLM judges often favor higher-quality models even when incorrect.

Abstract

To reduce the need for human annotations, large language models (LLMs) have been proposed as judges of the quality of other candidate models. The performance of LLM judges is typically evaluated by measuring the correlation with human judgments on generative tasks such as summarization or machine translation. In contrast, we study LLM judges on mathematical reasoning tasks. These tasks require multi-step reasoning, and the correctness of their solutions is verifiable, enabling a more objective evaluation. We perform a detailed performance analysis and find that easy samples are easy to judge, and difficult samples are difficult to judge. Our analysis uncovers a strong correlation between judgment performance and the candidate model task performance, indicating that judges tend to favor higher-quality models even if their answer is incorrect. As a consequence, we test whether we can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLegal Education and Practice Innovations