Evaluating the Impact of Data Augmentation on Predictive Model   Performance

Valdemar \v{S}v\'abensk\'y; Conrad Borchers; Elizabeth B. Cloude,; Atsushi Shimada

arXiv:2412.02108·cs.LG·December 4, 2024

Evaluating the Impact of Data Augmentation on Predictive Model Performance

Valdemar \v{S}v\'abensk\'y, Conrad Borchers, Elizabeth B. Cloude,, Atsushi Shimada

PDF

TL;DR

This study systematically evaluates various data augmentation techniques in learning analytics, demonstrating that sampling methods like SMOTE-ENN enhance predictive accuracy and efficiency, with some techniques potentially decreasing performance.

Contribution

It provides empirical evidence that sampling-based augmentation techniques are more reliable and computationally efficient than deep generation methods in learning analytics.

Findings

01

SMOTE-ENN sampling improves AUC by 0.01 and halves training time.

02

Adding noise to SMOTE-ENN yields a small but significant performance boost.

03

Some augmentation techniques can decrease predictive performance or increase variability.

Abstract

In supervised machine learning (SML) research, large training datasets are essential for valid results. However, obtaining primary data in learning analytics (LA) is challenging. Data augmentation can address this by expanding and diversifying data, though its use in LA remains underexplored. This paper systematically compares data augmentation techniques and their impact on prediction performance in a typical LA task: prediction of academic outcomes. Augmentation is demonstrated on four SML models, which we successfully replicated from a previous LAK study based on AUC values. Among 21 augmentation techniques, SMOTE-ENN sampling performed the best, improving the average AUC by 0.01 and approximately halving the training time compared to the baseline models. In addition, we compared 99 combinations of chaining 21 techniques, and found minor, although statistically significant,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.