Judging What We Cannot Solve: A Consequence-Based Approach for Oracle-Free Evaluation of Research-Level Math

Guijin Son; Donghun Yang; Hitesh Laxmichand Patel; Hyunwoo Ko; Amit Agarwal; Sunghee Ahn; Kyong-Ha Lee; Youngjae Yu

arXiv:2602.06291·cs.CL·February 9, 2026

Judging What We Cannot Solve: A Consequence-Based Approach for Oracle-Free Evaluation of Research-Level Math

Guijin Son, Donghun Yang, Hitesh Laxmichand Patel, Hyunwoo Ko, Amit Agarwal, Sunghee Ahn, Kyong-Ha Lee, Youngjae Yu

PDF

Open Access 1 Datasets

TL;DR

This paper introduces Consequence-Based Utility, an oracle-free evaluation method for research-level math solutions, which improves ranking accuracy over existing models by testing solutions' utility in related questions.

Contribution

It proposes a novel consequence-based evaluation approach that outperforms reward models and LLM judges in ranking research-level math solutions without requiring an oracle.

Findings

01

Outperforms reward models and LLM judges in ranking quality

02

Improves accuracy metrics significantly on GPT-OSS-120B and GPT-OSS-20B

03

Maintains strong separation between correct and incorrect solutions even when solvers fail

Abstract

Recent progress in reasoning models suggests that generating plausible attempts for research-level mathematics may be within reach, but verification remains a bottleneck, consuming scarce expert time. We hypothesize that a meaningful solution should contain enough method-level information that, when applied to a neighborhood of related questions, it should yield better downstream performance than incorrect solutions. Building on this idea, we propose \textbf{Consequence-Based Utility}, an oracle-free evaluator that scores each candidate by testing its value as an in-context exemplar in solving related yet verifiable questions. Our approach is evaluated on an original set of research-level math problems, each paired with one expert-written solution and nine LLM-generated solutions. Notably, Consequence-Based Utility consistently outperforms reward models, generative reward models, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

amphora/ExpertMath
dataset· 35 dl
35 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing · Explainable Artificial Intelligence (XAI) · Scientific Computing and Data Management