On the (Mis)Use of Machine Learning with Panel Data
Augusto Cerqua, Marco Letta, Gabriele Pinto

TL;DR
This paper systematically examines data leakage in machine learning with panel data, highlighting how neglecting data structure can inflate performance metrics and mislead real-world applicability, and provides practical guidelines for correct implementation.
Contribution
It offers the first comprehensive assessment of data leakage issues in panel data machine learning and proposes empirical guidelines for practitioners.
Findings
Data leakage causes inflated performance metrics.
Neglecting data structure leads to overestimated model usefulness.
Guidelines improve model validity in panel data applications.
Abstract
We provide the first systematic assessment of data leakage issues in the use of machine learning on panel data. Our organizing framework clarifies why neglecting the cross-sectional and longitudinal structure of these data leads to hard-to-detect data leakage, inflated out-of-sample performance, and an inadvertent overestimation of the real-world usefulness and applicability of machine learning models. We then offer empirical guidelines for practitioners to ensure the correct implementation of supervised machine learning in panel data environments. An empirical application, using data from over 3,000 U.S. counties spanning 2000-2019 and focused on income prediction, illustrates the practical relevance of these points across nearly 500 models for both classification and regression tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpatial and Panel Data Analysis
MethodsALIGN
