Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark

Himanshu Gupta; Shreyas Verma; Ujjwala Anantheswaran; Kevin Scaria; Mihir Parmar; Swaroop Mishra; Chitta Baral

arXiv:2410.14702·cs.AI·May 12, 2026·2 cites

Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark

Himanshu Gupta, Shreyas Verma, Ujjwala Anantheswaran, Kevin Scaria, Mihir Parmar, Swaroop Mishra, Chitta Baral

PDF

1 Repo 1 Datasets

TL;DR

PolyMATH is a challenging multi-modal reasoning benchmark with 5,000 images across 10 categories, revealing current MLLMs' struggles with spatial and high-level reasoning tasks.

Contribution

This paper introduces PolyMATH, a new benchmark for evaluating multi-modal reasoning in large language models, highlighting their limitations and guiding future improvements.

Findings

01

Top models achieve only around 41% accuracy, indicating high difficulty.

02

Models struggle with spatial relations and high-level reasoning tasks.

03

Replacing images with textual descriptions yields only 4% performance improvement.

Abstract

Multi-modal Large Language Models (MLLMs) exhibit impressive problem-solving abilities in various domains, but their visual comprehension and abstract reasoning skills remain under-evaluated. To this end, we present PolyMATH, a challenging benchmark aimed at evaluating the general cognitive reasoning abilities of MLLMs. PolyMATH comprises 5,000 manually collected high-quality images of cognitive textual and visual challenges across 10 distinct categories, including pattern recognition, spatial reasoning, and relative reasoning. We conducted a comprehensive, and quantitative evaluation of 15 MLLMs using four diverse prompting strategies, including Chain-of-Thought and Step-Back. The best scores achieved on PolyMATH are ~41%, ~36%, and ~27%, obtained by Claude-3.5 Sonnet, GPT-4o and Gemini-1.5 Pro respectively - highlighting the logical and visual complexity of these questions. A further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

polymathbenchmark/PolyMATH
github

Datasets

him1411/polymath
dataset· 651 dl
651 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.