The CompMath-MCQ Dataset: Are LLMs Ready for Higher-Level Math?
Bianca Raimondi, Francesco Pivi, Davide Evangelista, Maurizio Gabbrielli

TL;DR
This paper introduces CompMath-MCQ, a new benchmark dataset with graduate-level math questions designed to evaluate the reasoning capabilities of Large Language Models in advanced mathematical topics.
Contribution
The paper presents a novel, carefully curated multiple-choice dataset for assessing LLMs on complex mathematical reasoning beyond elementary problems.
Findings
State-of-the-art LLMs struggle with advanced mathematical reasoning.
The dataset enables objective and reproducible evaluation.
Questions are newly created to prevent data leakage.
Abstract
The evaluation of Large Language Models (LLMs) on mathematical reasoning has largely focused on elementary problems, competition-style questions, or formal theorem proving, leaving graduate-level and computational mathematics relatively underexplored. We introduce CompMath-MCQ, a new benchmark dataset for assessing LLMs on advanced mathematical reasoning in a multiple-choice setting. The dataset consists of 1{,}500 originally authored questions by professors of graduate-level courses, covering topics including Linear Algebra, Numerical Optimization, Vector Calculus, Probability, and Python-based scientific computing. Three option choices are provided for each question, with exactly one of them being correct. To ensure the absence of data leakage, all questions are newly created and not sourced from existing materials. The validity of questions is verified through a procedure based on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics, Computing, and Information Processing · Machine Learning in Materials Science · Mathematics Education and Programs
