What properties of reasoning supervision are associated with improved downstream model quality?
Miko{\l}aj Langner, Dzmitry Pihulski, Jan Eliasz, Micha{\l} Rajkowski, Przemys{\l}aw Kazienko, Maciej Piasecki, Jan Koco\'n, Teddy Ferdinan

TL;DR
This paper explores intrinsic data metrics that can predict the usefulness of reasoning datasets for training large language models, reducing the need for costly trial-and-error validation.
Contribution
It introduces a set of quantitative measures that correlate with downstream performance and reveals scale-dependent differences in data utility predictors.
Findings
Intrinsic metrics strongly correlate with model performance.
Smaller models rely on alignment-focused metrics for data validation.
Larger models benefit from redundancy and verbose traces.
Abstract
Validating training data for reasoning models typically requires expensive trial-and-error fine-tuning cycles. In this work, we investigate whether the utility of a reasoning dataset can be reliably predicted prior to training using intrinsic data metrics. We propose a suite of quantitative measures and evaluate their predictive power by fine-tuning 8B and 11B models on semantically distinct variants of a Polish reasoning dataset. Our analysis reveals that these intrinsic metrics demonstrate strong and significant correlations with downstream model performance. Crucially, we find that the predictors of utility are scale-dependent: smaller models rely on alignment-focused metrics to ensure precision, whereas larger models benefit from high redundancy, utilizing verbose traces to solve complex tasks. These findings establish a scale-aware framework for validating reasoning data, enabling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
