Beyond Gold Standards: Epistemic Ensemble of LLM Judges for Formal Mathematical Reasoning
Lan Zhang, Marco Valentino, Andre Freitas

TL;DR
This paper introduces an epistemic ensemble of LLM judges that systematically evaluates autoformalization in formal mathematics across multiple nuanced criteria, outperforming coarse-grained methods and aligning more closely with human judgment.
Contribution
It proposes a novel, multi-criteria LLM ensemble framework for automatic evaluation of formal mathematical reasoning, enhancing accuracy and interpretability.
Findings
The ensemble correlates better with human assessments than coarse models.
It effectively evaluates logical preservation, mathematical consistency, and formal validity.
The method offers a scalable and transparent evaluation approach.
Abstract
Autoformalization plays a crucial role in formal mathematical reasoning by enabling the automatic translation of natural language statements into formal languages. While recent advances using large language models (LLMs) have shown promising results, methods for automatically evaluating autoformalization remain underexplored. As one moves to more complex domains (e.g., advanced mathematics), human evaluation requires significant time and domain expertise, especially as the complexity of the underlying statements and background knowledge increases. LLM-as-a-judge presents a promising approach for automating such evaluation. However, existing methods typically employ coarse-grained and generic evaluation criteria, which limit their effectiveness for advanced formal mathematical reasoning, where quality hinges on nuanced, multi-granular dimensions. In this work, we take a step toward…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics, Computing, and Information Processing · Machine Learning in Materials Science · Model Reduction and Neural Networks
MethodsSparse Evolutionary Training
