CombiGraph-Vis: A Curated Multimodal Olympiad Benchmark for Discrete Mathematical Reasoning

Hamed Mahdavi (Pennsylvania State University); Pouria Mahdavinia (Pennsylvania State University); Alireza Farhadi (Amirkabir University of Technology); Pegah Mohammadipour (Pennsylvania State University); Samira Malek (Pennsylvania State University); Majid Daliri (New York University); Pedram Mohammadipour (Amirkabir University of Technology); Alireza Hashemi (City University of New York); Amir Khasahmadi (Autodesk); Vasant Honavar (Pennsylvania State University)

arXiv:2510.27094·cs.AI·November 3, 2025

CombiGraph-Vis: A Curated Multimodal Olympiad Benchmark for Discrete Mathematical Reasoning

Hamed Mahdavi (Pennsylvania State University), Pouria Mahdavinia (Pennsylvania State University), Alireza Farhadi (Amirkabir University of Technology), Pegah Mohammadipour (Pennsylvania State University), Samira Malek (Pennsylvania State University)

PDF

Open Access

TL;DR

This paper evaluates large language models' ability to grade mathematical proofs and solutions, introduces agentic workflows for improved grading accuracy, and demonstrates enhanced agreement with human assessments.

Contribution

It presents novel workflows for automated proof and solution grading, addressing calibration gaps and improving consistency with human judgments.

Findings

01

Models reliably detect incorrect solutions

02

Proposed workflows improve grading agreement with humans

03

Enhanced handling of partial credit in automated grading

Abstract

State-of-the-art (SOTA) LLMs have progressed from struggling on proof-based Olympiad problems to solving most of the IMO 2025 problems, with leading systems reportedly handling 5 of 6 problems. Given this progress, we assess how well these models can grade proofs: detecting errors, judging their severity, and assigning fair scores beyond binary correctness. We study proof-analysis capabilities using a corpus of 90 Gemini 2.5 Pro-generated solutions that we grade on a 1-4 scale with detailed error annotations, and on MathArena solution sets for IMO/USAMO 2025 scored on a 0-7 scale. Our analysis shows that models can reliably flag incorrect (including subtly incorrect) solutions but exhibit calibration gaps in how partial credit is assigned. To address this, we introduce agentic workflows that extract and analyze reference solutions and automatically derive problem-specific rubrics for a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing · Advanced Graph Neural Networks · Intelligent Tutoring Systems and Adaptive Learning