HARDMath: A Benchmark Dataset for Challenging Problems in Applied   Mathematics

Jingxuan Fan; Sarah Martinson; Erik Y. Wang; Kaylie Hausknecht; Jonah; Brenner; Danxian Liu; Nianli Peng; Corey Wang; Michael P. Brenner

arXiv:2410.09988·cs.LG·December 17, 2024

HARDMath: A Benchmark Dataset for Challenging Problems in Applied Mathematics

Jingxuan Fan, Sarah Martinson, Erik Y. Wang, Kaylie Hausknecht, Jonah, Brenner, Danxian Liu, Nianli Peng, Corey Wang, Michael P. Brenner

PDF

Open Access 1 Repo

TL;DR

HARDMath is a new benchmark dataset designed to evaluate large language models on challenging graduate-level applied mathematics problems requiring reasoning, approximation, and judgment, revealing current models' limitations.

Contribution

We introduce HARDMath, a novel dataset of complex applied mathematics problems inspired by graduate courses, with solutions validated against numerical ground truths, to evaluate and improve LLM performance.

Findings

01

LLMs perform poorly on HARDMath, with GPT-4 achieving only 43.8% accuracy.

02

Current models struggle with advanced applied math problems compared to existing benchmarks.

03

Error analysis highlights specific reasoning and approximation challenges for LLMs.

Abstract

Advanced applied mathematics problems are underrepresented in existing Large Language Model (LLM) benchmark datasets. To address this, we introduce HARDMath, a dataset inspired by a graduate course on asymptotic methods, featuring challenging applied mathematics problems that require analytical approximation techniques. These problems demand a combination of mathematical reasoning, computational tools, and subjective judgment, making them difficult for LLMs. Our framework auto-generates a large number of problems with solutions validated against numerical ground truths. We evaluate both open- and closed-source LLMs on HARDMath-mini, a sub-sampled test set of 366 problems, as well as on 40 word problems formulated in applied science contexts. Even leading closed-source models like GPT-4 achieve only 43.8% overall accuracy with few-shot Chain-of-Thought prompting, and all models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sarahmart/hardmath
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning

MethodsAttention Is All You Need · Dropout · Layer Normalization · Adam · Dense Connections · Residual Connection · Position-Wise Feed-Forward Layer · Linear Layer · Byte Pair Encoding · Absolute Position Encodings