HARDMath2: A Benchmark for Applied Mathematics Built by Students as Part of a Graduate Class

James V. Roggeveen; Erik Y. Wang; Will Flintoft; Peter Donets; Lucy S. Nathwani; Nickholas Gutierrez; David Ettel; Anton Marius Graf; Siddharth Dandavate; Arjun Nageswaran; Raglan Ward; Ava Williamson; Anne Mykland; Kacper K. Migacz; Yijun Wang; Egemen Bostan; Duy Thuc Nguyen; Zhe He; Marc L. Descoteaux; Felix Yeung; Shida Liu; Jorge Garc\'ia Ponce; Luke Zhu; Yuyang Chen; Ekaterina S. Ivshina; Miguel Fernandez; Minjae Kim; Kennan Gumbs; Matthew Scott Tan; Russell Yang; Mai Hoang; David Brown; Isabella A. Silveira; Lavon Sykes; Ahmed Roman; William Fredenberg; Yiming Chen; Lucas Martin; Yixing Tang; Kelly Werker Smith; Hongyu Liao; Logan G. Wilson; Alexander Dazhen Cai; Andrea Elizabeth Biju; Michael P. Brenner

arXiv:2505.11774·cs.LG·May 20, 2025

HARDMath2: A Benchmark for Applied Mathematics Built by Students as Part of a Graduate Class

James V. Roggeveen, Erik Y. Wang, Will Flintoft, Peter Donets, Lucy S. Nathwani, Nickholas Gutierrez, David Ettel, Anton Marius Graf, Siddharth Dandavate, Arjun Nageswaran, Raglan Ward, Ava Williamson, Anne Mykland, Kacper K. Migacz, Yijun Wang, Egemen Bostan, Duy Thuc Nguyen

PDF

Open Access 1 Repo

TL;DR

HARDMath2 is a new benchmark dataset of 211 applied mathematics problems created by students and instructors, designed to evaluate and improve the mathematical reasoning capabilities of large language models in approximation-based problems.

Contribution

This paper introduces HARDMath2, a collaboratively developed dataset of graduate-level applied math problems, and demonstrates its effectiveness in highlighting current LLM limitations and fostering student understanding.

Findings

01

Leading models struggle with many HARDMath2 problems.

02

Student interaction enhances problem difficulty and understanding.

03

Benchmark reveals gaps in current LLM mathematical reasoning.

Abstract

Large language models (LLMs) have shown remarkable progress in mathematical problem-solving, but evaluation has largely focused on problems that have exact analytical solutions or involve formal proofs, often overlooking approximation-based problems ubiquitous in applied science and engineering. To fill this gap, we build on prior work and present HARDMath2, a dataset of 211 original problems covering the core topics in an introductory graduate applied math class, including boundary-layer analysis, WKB methods, asymptotic solutions of nonlinear partial differential equations, and the asymptotics of oscillatory integrals. This dataset was designed and verified by the students and instructors of a core graduate applied mathematics course at Harvard. We build the dataset through a novel collaborative environment that challenges students to write and refine difficult problems consistent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

JamesRoggeveen/hardmath2_eval
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsModel Reduction and Neural Networks · Mathematics, Computing, and Information Processing · Machine Learning in Materials Science