Revisiting Generalization Across Difficulty Levels: It's Not So Easy
Yeganeh Kordi, Nihal V. Nayak, Max Zuo, Ilana Nguyen, Stephen H. Bach

TL;DR
This paper systematically evaluates how large language models generalize across different task difficulties, revealing that training on easy or hard data alone often does not lead to consistent improvements, emphasizing the need for diverse difficulty levels.
Contribution
It introduces a novel, objective difficulty ranking method for datasets based on multiple LLMs and IRT, providing a finer-grained analysis of LLM generalization across difficulty levels.
Findings
Cross-difficulty generalization is often limited.
Training on easy or hard data alone does not guarantee improvements.
Diverse difficulty data is crucial for effective LLM training and evaluation.
Abstract
We investigate how well large language models (LLMs) generalize across different task difficulties, a key question for effective data curation and evaluation. Existing research is mixed regarding whether training on easier or harder data leads to better results, and whether those gains come on easier or harder test data. We address this question by conducting a systematic evaluation of LLMs' generalization across models, datasets, and fine-grained groups of example difficulty. We rank examples in six datasets using the outputs of thousands of different LLMs and Item Response Theory (IRT), a well-established difficulty metric in educational testing. Unlike prior work, our difficulty ratings are therefore determined solely by the abilities of many different LLMs, excluding human opinions of difficulty. With a more objective, larger-scale, and finer-grained analysis, we show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsText Readability and Simplification · Topic Modeling · Intelligent Tutoring Systems and Adaptive Learning
