On the (Mis)Use of Machine Learning with Panel Data

Augusto Cerqua; Marco Letta; Gabriele Pinto

arXiv:2411.09218·econ.EM·May 6, 2025

On the (Mis)Use of Machine Learning with Panel Data

Augusto Cerqua, Marco Letta, Gabriele Pinto

PDF

Open Access

TL;DR

This paper systematically examines data leakage in machine learning with panel data, highlighting how neglecting data structure can inflate performance metrics and mislead real-world applicability, and provides practical guidelines for correct implementation.

Contribution

It offers the first comprehensive assessment of data leakage issues in panel data machine learning and proposes empirical guidelines for practitioners.

Findings

01

Data leakage causes inflated performance metrics.

02

Neglecting data structure leads to overestimated model usefulness.

03

Guidelines improve model validity in panel data applications.

Abstract

We provide the first systematic assessment of data leakage issues in the use of machine learning on panel data. Our organizing framework clarifies why neglecting the cross-sectional and longitudinal structure of these data leads to hard-to-detect data leakage, inflated out-of-sample performance, and an inadvertent overestimation of the real-world usefulness and applicability of machine learning models. We then offer empirical guidelines for practitioners to ensure the correct implementation of supervised machine learning in panel data environments. An empirical application, using data from over 3,000 U.S. counties spanning 2000-2019 and focused on income prediction, illustrates the practical relevance of these points across nearly 500 models for both classification and regression tasks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpatial and Panel Data Analysis

MethodsALIGN