Winning Gold at IMO 2025 with a Model-Agnostic Verification-and-Refinement Pipeline
Yichen Huang, Lin F. Yang

TL;DR
The paper introduces a model-agnostic verification-and-refinement pipeline that significantly improves the accuracy of large language models on IMO 2025 problems, demonstrating a new approach to complex mathematical reasoning.
Contribution
It presents a novel, model-agnostic pipeline that enhances reasoning accuracy on high-level math problems, outperforming baseline model performances.
Findings
Achieved approximately 85.7% accuracy on IMO 2025 problems using the pipeline.
Baseline models had significantly lower accuracies, e.g., 31.6%, 21.4%, and 38.1%.
Pipeline effectively leverages existing models without requiring retraining.
Abstract
The International Mathematical Olympiad (IMO) is widely regarded as the world championship of high-school mathematics. IMO problems are renowned for their difficulty and novelty, demanding deep insight, creativity, and rigor. Although large language models perform well on many mathematical benchmarks, they often struggle with Olympiad-level problems. Using carefully designed prompts, we construct a model-agnostic, verification-and-refinement pipeline. We demonstrate its effectiveness on the recent IMO 2025, avoiding data contamination for models released before the competition. Equipped with any of the three leading models -- Gemini 2.5 Pro, Grok-4, or GPT-5 -- our pipeline correctly solved 5 out of the 6 problems (85.7% accuracy). This is in sharp contrast to their baseline accuracies: 31.6% (Gemini 2.5 Pro), 21.4% (Grok-4), and 38.1% (GPT-5), obtained by selecting the best of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
