HARP: A challenging human-annotated math reasoning benchmark

Albert S. Yue; Lovish Madaan; Ted Moskovitz; DJ Strouse; Aaditya K.; Singh

arXiv:2412.08819·cs.LG·December 13, 2024

HARP: A challenging human-annotated math reasoning benchmark

Albert S. Yue, Lovish Madaan, Ted Moskovitz, DJ Strouse, Aaditya K., Singh

PDF

Open Access 1 Repo 1 Datasets

TL;DR

HARP is a new challenging math reasoning benchmark with 5,409 human-annotated problems from US math competitions, designed to evaluate and push the limits of large language models' math reasoning abilities.

Contribution

The paper introduces HARP, a comprehensive, human-annotated math reasoning dataset with multiple difficulty levels and solutions, addressing the saturation of existing benchmarks.

Findings

01

Frontier models perform poorly on the hardest problems.

02

Models scale their compute with problem difficulty.

03

HARP enables new research avenues in math reasoning.

Abstract

Math reasoning is becoming an ever increasing area of focus as we scale large language models. However, even the previously-toughest evals like MATH are now close to saturated by frontier models (90.0% for o1-mini and 86.5% for Gemini 1.5 Pro). We introduce HARP, Human Annotated Reasoning Problems (for Math), consisting of 5,409 problems from the US national math competitions (A(J)HSME, AMC, AIME, USA(J)MO). Of these, 4,780 have answers that are automatically check-able (with libraries such as SymPy). These problems range six difficulty levels, with frontier models performing relatively poorly on the hardest bracket of 197 problems (average accuracy 41.1% for o1-mini, and 9.6% for Gemini 1.5 Pro). Our dataset also features multiple choices (for 4,110 problems) and an average of two human-written, ground-truth solutions per problem, offering new avenues of research that we explore…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aadityasingh/harp
noneOfficial

Datasets

Intelligent-Internet/II-Thought-RL-v0
dataset· 299 dl
299 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning

MethodsFocus