Judging What We Cannot Solve: A Consequence-Based Approach for Oracle-Free Evaluation of Research-Level Math
Guijin Son, Donghun Yang, Hitesh Laxmichand Patel, Hyunwoo Ko, Amit Agarwal, Sunghee Ahn, Kyong-Ha Lee, Youngjae Yu

TL;DR
This paper introduces Consequence-Based Utility, an oracle-free evaluation method for research-level math solutions, which improves ranking accuracy over existing models by testing solutions' utility in related questions.
Contribution
It proposes a novel consequence-based evaluation approach that outperforms reward models and LLM judges in ranking research-level math solutions without requiring an oracle.
Findings
Outperforms reward models and LLM judges in ranking quality
Improves accuracy metrics significantly on GPT-OSS-120B and GPT-OSS-20B
Maintains strong separation between correct and incorrect solutions even when solvers fail
Abstract
Recent progress in reasoning models suggests that generating plausible attempts for research-level mathematics may be within reach, but verification remains a bottleneck, consuming scarce expert time. We hypothesize that a meaningful solution should contain enough method-level information that, when applied to a neighborhood of related questions, it should yield better downstream performance than incorrect solutions. Building on this idea, we propose \textbf{Consequence-Based Utility}, an oracle-free evaluator that scores each candidate by testing its value as an in-context exemplar in solving related yet verifiable questions. Our approach is evaluated on an original set of research-level math problems, each paired with one expert-written solution and nine LLM-generated solutions. Notably, Consequence-Based Utility consistently outperforms reward models, generative reward models, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics, Computing, and Information Processing · Explainable Artificial Intelligence (XAI) · Scientific Computing and Data Management
