Evaluating the Impact of Data Augmentation on Predictive Model Performance
Valdemar \v{S}v\'abensk\'y, Conrad Borchers, Elizabeth B. Cloude,, Atsushi Shimada

TL;DR
This study systematically evaluates various data augmentation techniques in learning analytics, demonstrating that sampling methods like SMOTE-ENN enhance predictive accuracy and efficiency, with some techniques potentially decreasing performance.
Contribution
It provides empirical evidence that sampling-based augmentation techniques are more reliable and computationally efficient than deep generation methods in learning analytics.
Findings
SMOTE-ENN sampling improves AUC by 0.01 and halves training time.
Adding noise to SMOTE-ENN yields a small but significant performance boost.
Some augmentation techniques can decrease predictive performance or increase variability.
Abstract
In supervised machine learning (SML) research, large training datasets are essential for valid results. However, obtaining primary data in learning analytics (LA) is challenging. Data augmentation can address this by expanding and diversifying data, though its use in LA remains underexplored. This paper systematically compares data augmentation techniques and their impact on prediction performance in a typical LA task: prediction of academic outcomes. Augmentation is demonstrated on four SML models, which we successfully replicated from a previous LAK study based on AUC values. Among 21 augmentation techniques, SMOTE-ENN sampling performed the best, improving the average AUC by 0.01 and approximately halving the training time compared to the baseline models. In addition, we compared 99 combinations of chaining 21 techniques, and found minor, although statistically significant,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
