Data Difficulty and the Generalization--Extrapolation Tradeoff in LLM Fine-Tuning

Siyuan Liu (IIIS; Tsinghua University); Tinghong Chen (College of AI; Tsinghua University; Shanghai Qi Zhi Institute); Xinghan Li (IIIS; Tsinghua University); Yifei Wang (Amazon AGI SF Lab); Jingzhao Zhang (IIIS; Tsinghua University; Shanghai Qi Zhi Institute)

arXiv:2605.12906·cs.LG·May 14, 2026

Data Difficulty and the Generalization--Extrapolation Tradeoff in LLM Fine-Tuning

Siyuan Liu (IIIS, Tsinghua University), Tinghong Chen (College of AI, Tsinghua University, Shanghai Qi Zhi Institute), Xinghan Li (IIIS, Tsinghua University), Yifei Wang (Amazon AGI SF Lab), Jingzhao Zhang (IIIS, Tsinghua University, Shanghai Qi Zhi Institute)

PDF

TL;DR

This paper investigates how data difficulty influences large language model fine-tuning, revealing that the optimal difficulty level depends on dataset size and affects the generalization-extrapolation trade-off.

Contribution

It provides a systematic empirical and theoretical analysis showing the relationship between data difficulty, dataset size, and model generalization in fine-tuning LLMs.

Findings

01

Optimal data difficulty varies with dataset size.

02

Harder data becomes more beneficial as data budget increases.

03

The interplay between generalization gap and extrapolation gap explains the phenomenon.

Abstract

Data selection during supervised fine-tuning (SFT) can critically change the behavior of large language models (LLMs). Although existing work has studied the effect of selecting data based on heuristics such as perplexity, difficulty, or length, the reported findings are often inconsistent or context-dependent. In this work, we systematically study the role of data difficulty in fine-tuning from both empirical and theoretical perspectives, and find that there is no universally optimal difficulty level; rather, its effectiveness depends on the dataset size. We show that for a fixed data budget, there exists an optimal data difficulty for SFT, and that this optimal difficulty shifts toward harder data as the data budget increases. To explain this phenomenon, we conduct controlled synthetic experiments that reveal a simple underlying mechanism: the interplay between the (in-distribution)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.