Beyond Random Sampling: Instance Quality-Based Data Partitioning via Item Response Theory

Lucas Cardoso; Vitor Santos; Jos\'e Ribeiro Filho; Ricardo Prud\^encio; Regiane Kawasaki; Ronnie Alves

arXiv:2508.10628·cs.LG·August 15, 2025

Beyond Random Sampling: Instance Quality-Based Data Partitioning via Item Response Theory

Lucas Cardoso, Vitor Santos, Jos\'e Ribeiro Filho, Ricardo Prud\^encio, Regiane Kawasaki, Ronnie Alves

PDF

TL;DR

This paper introduces a novel data partitioning method using Item Response Theory to improve ML model validation by accounting for instance quality, revealing data heterogeneity, and optimizing bias-variance tradeoffs.

Contribution

It applies IRT parameters to dataset partitioning, demonstrating improved understanding of data heterogeneity and model performance impacts in ML validation.

Findings

01

IRT reveals data heterogeneity and informative subgroups.

02

Balanced partitions improve bias-variance understanding.

03

High-guessing instances can impair model accuracy significantly.

Abstract

Robust validation of Machine Learning (ML) models is essential, but traditional data partitioning approaches often ignore the intrinsic quality of each instance. This study proposes the use of Item Response Theory (IRT) parameters to characterize and guide the partitioning of datasets in the model validation stage. The impact of IRT-informed partitioning strategies on the performance of several ML models in four tabular datasets was evaluated. The results obtained demonstrate that IRT reveals an inherent heterogeneity of the instances and highlights the existence of informative subgroups of instances within the same dataset. Based on IRT, balanced partitions were created that consistently help to better understand the tradeoff between bias and variance of the models. In addition, the guessing parameter proved to be a determining factor: training with high-guessing instances can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.