Easy2Hard-Bench: Standardized Difficulty Labels for Profiling LLM Performance and Generalization

Mucong Ding; Chenghao Deng; Jocelyn Choo; Zichu Wu; Aakriti Agrawal; Avi Schwarzschild; Tianyi Zhou; Tom Goldstein; John Langford; Anima Anandkumar; Furong Huang

arXiv:2409.18433·cs.LG·June 10, 2025·3 cites

Easy2Hard-Bench: Standardized Difficulty Labels for Profiling LLM Performance and Generalization

Mucong Ding, Chenghao Deng, Jocelyn Choo, Zichu Wu, Aakriti Agrawal, Avi Schwarzschild, Tianyi Zhou, Tom Goldstein, John Langford, Anima Anandkumar, Furong Huang

PDF

Open Access 1 Datasets

TL;DR

Easy2Hard-Bench introduces a standardized collection of benchmark datasets with difficulty labels across various domains, enabling systematic profiling of LLM performance and generalization from easy to hard problems.

Contribution

The paper presents a new benchmark collection with fine-grained difficulty annotations and applies established ranking systems to evaluate LLMs across difficulty levels.

Findings

01

State-of-the-art LLMs show varied performance across difficulty levels.

02

Datasets contain a higher proportion of challenging problems than previous collections.

03

Analysis reveals insights into LLM generalization capabilities.

Abstract

While generalization over tasks from easy to hard is crucial to profile language models (LLMs), the datasets with fine-grained difficulty annotations for each problem across a broad range of complexity are still blank. Aiming to address this limitation, we present Easy2Hard-Bench, a consistently formatted collection of 6 benchmark datasets spanning various domains, such as mathematics and programming problems, chess puzzles, and reasoning questions. Each problem within these datasets is annotated with numerical difficulty scores. To systematically estimate problem difficulties, we collect abundant performance data on attempts to each problem by humans in the real world or LLMs on the prominent leaderboard. Leveraging the rich performance data, we apply well-established difficulty ranking systems, such as Item Response Theory (IRT) and Glicko-2 models, to uniformly assign numerical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

furonghuang-lab/Easy2Hard-Bench
dataset· 613 dl
613 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques