Beyond Gold Standards: Epistemic Ensemble of LLM Judges for Formal Mathematical Reasoning

Lan Zhang; Marco Valentino; Andre Freitas

arXiv:2506.10903·cs.CL·June 13, 2025

Beyond Gold Standards: Epistemic Ensemble of LLM Judges for Formal Mathematical Reasoning

Lan Zhang, Marco Valentino, Andre Freitas

PDF

Open Access

TL;DR

This paper introduces an epistemic ensemble of LLM judges that systematically evaluates autoformalization in formal mathematics across multiple nuanced criteria, outperforming coarse-grained methods and aligning more closely with human judgment.

Contribution

It proposes a novel, multi-criteria LLM ensemble framework for automatic evaluation of formal mathematical reasoning, enhancing accuracy and interpretability.

Findings

01

The ensemble correlates better with human assessments than coarse models.

02

It effectively evaluates logical preservation, mathematical consistency, and formal validity.

03

The method offers a scalable and transparent evaluation approach.

Abstract

Autoformalization plays a crucial role in formal mathematical reasoning by enabling the automatic translation of natural language statements into formal languages. While recent advances using large language models (LLMs) have shown promising results, methods for automatically evaluating autoformalization remain underexplored. As one moves to more complex domains (e.g., advanced mathematics), human evaluation requires significant time and domain expertise, especially as the complexity of the underlying statements and background knowledge increases. LLM-as-a-judge presents a promising approach for automating such evaluation. However, existing methods typically employ coarse-grained and generic evaluation criteria, which limit their effectiveness for advanced formal mathematical reasoning, where quality hinges on nuanced, multi-granular dimensions. In this work, we take a step toward…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing · Machine Learning in Materials Science · Model Reduction and Neural Networks

MethodsSparse Evolutionary Training