CombiGraph-Vis: A Curated Multimodal Olympiad Benchmark for Discrete Mathematical Reasoning
Hamed Mahdavi (Pennsylvania State University), Pouria Mahdavinia (Pennsylvania State University), Alireza Farhadi (Amirkabir University of Technology), Pegah Mohammadipour (Pennsylvania State University), Samira Malek (Pennsylvania State University)

TL;DR
This paper evaluates large language models' ability to grade mathematical proofs and solutions, introduces agentic workflows for improved grading accuracy, and demonstrates enhanced agreement with human assessments.
Contribution
It presents novel workflows for automated proof and solution grading, addressing calibration gaps and improving consistency with human judgments.
Findings
Models reliably detect incorrect solutions
Proposed workflows improve grading agreement with humans
Enhanced handling of partial credit in automated grading
Abstract
State-of-the-art (SOTA) LLMs have progressed from struggling on proof-based Olympiad problems to solving most of the IMO 2025 problems, with leading systems reportedly handling 5 of 6 problems. Given this progress, we assess how well these models can grade proofs: detecting errors, judging their severity, and assigning fair scores beyond binary correctness. We study proof-analysis capabilities using a corpus of 90 Gemini 2.5 Pro-generated solutions that we grade on a 1-4 scale with detailed error annotations, and on MathArena solution sets for IMO/USAMO 2025 scored on a 0-7 scale. Our analysis shows that models can reliably flag incorrect (including subtly incorrect) solutions but exhibit calibration gaps in how partial credit is assigned. To address this, we introduce agentic workflows that extract and analyze reference solutions and automatically derive problem-specific rubrics for a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics, Computing, and Information Processing · Advanced Graph Neural Networks · Intelligent Tutoring Systems and Adaptive Learning
