FormalRewardBench: A Benchmark for Formal Theorem Proving Reward Models

Zeynel A. Ulu\c{s}an; Burak S. Akbudak; Can S. Erer; G\"ozde G\"ul \c{S}ahin

arXiv:2605.10141·cs.AI·May 12, 2026

FormalRewardBench: A Benchmark for Formal Theorem Proving Reward Models

Zeynel A. Ulu\c{s}an, Burak S. Akbudak, Can S. Erer, G\"ozde G\"ul \c{S}ahin

PDF

TL;DR

This paper introduces FormalRewardBench, a new benchmark for evaluating reward models in formal theorem proving, highlighting the performance gap between general LLMs and specialized provers.

Contribution

It presents the first benchmark for reward model evaluation in formal theorem proving, with curated error injection strategies and comprehensive model assessments.

Findings

01

Frontier LLMs achieve 59.8% accuracy in proof evaluation.

02

Specialized theorem provers perform worse, at 24.4%.

03

Proof evaluation ability does not transfer well from theorem proving skills.

Abstract

Recent neural theorem provers use reinforcement learning with verifiable rewards (RLVR), where proof assistants provide binary correctness signals. While verifiable rewards are cheap and scalable without reward hacking issues, they suffer from sparse credit assignment: models receive no learning signal from difficult problems where partial progress goes unrewarded. This motivates learned reward models that can evaluate proof quality beyond binary verification. However, comparing reward models is challenging since it typically requires expensive RL training ablations. To address this, we introduce \textbf{FormalRewardBench}, the first benchmark for evaluating reward models in formal theorem proving with Lean 4. Our benchmark consists of 250 preference pairs where correct proofs are paired with incorrect variants generated through five expert curated error injection strategies: forced…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.