Revisiting Generalization Across Difficulty Levels: It's Not So Easy

Yeganeh Kordi; Nihal V. Nayak; Max Zuo; Ilana Nguyen; Stephen H. Bach

arXiv:2511.21692·cs.CL·November 27, 2025

Revisiting Generalization Across Difficulty Levels: It's Not So Easy

Yeganeh Kordi, Nihal V. Nayak, Max Zuo, Ilana Nguyen, Stephen H. Bach

PDF

Open Access 2 Datasets 1 Video

TL;DR

This paper systematically evaluates how large language models generalize across different task difficulties, revealing that training on easy or hard data alone often does not lead to consistent improvements, emphasizing the need for diverse difficulty levels.

Contribution

It introduces a novel, objective difficulty ranking method for datasets based on multiple LLMs and IRT, providing a finer-grained analysis of LLM generalization across difficulty levels.

Findings

01

Cross-difficulty generalization is often limited.

02

Training on easy or hard data alone does not guarantee improvements.

03

Diverse difficulty data is crucial for effective LLM training and evaluation.

Abstract

We investigate how well large language models (LLMs) generalize across different task difficulties, a key question for effective data curation and evaluation. Existing research is mixed regarding whether training on easier or harder data leads to better results, and whether those gains come on easier or harder test data. We address this question by conducting a systematic evaluation of LLMs' generalization across models, datasets, and fine-grained groups of example difficulty. We rank examples in six datasets using the outputs of thousands of different LLMs and Item Response Theory (IRT), a well-established difficulty metric in educational testing. Unlike prior work, our difficulty ratings are therefore determined solely by the abilities of many different LLMs, excluding human opinions of difficulty. With a more objective, larger-scale, and finer-grained analysis, we show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

Revisiting Generalization Across Difficulty Levels: It's Not So Easy· underline

Taxonomy

TopicsText Readability and Simplification · Topic Modeling · Intelligent Tutoring Systems and Adaptive Learning