Predictive and explanatory models might miss informative features in educational data
Nicholas T. Young, Marcos D. Caballero

TL;DR
This study investigates how predictive models in educational data mining handle features with little variation, revealing that algorithm performance varies with data imbalance and recommending penalized regression methods for better analysis.
Contribution
It systematically examines the impact of data imbalance on model treatment of features and proposes penalized regression as a solution in educational data mining.
Findings
Algorithms treat features differently based on imbalance and odds ratios.
Penalized methods like Firth and Log-F reduce bias in odds ratio estimation.
Models may miss informative features due to data imbalance.
Abstract
We encounter variables with little variation often in educational data mining (EDM) due to the demographics of higher education and the questions we ask. Yet, little work has examined how to analyze such data. Therefore, we conducted a simulation study using logistic regression, penalized regression, and random forest. We systematically varied the fraction of positive outcomes, feature imbalances, and odds ratios. We find the algorithms treat features with the same odds ratios differently based on the features' imbalance and the outcome imbalance. While none of the algorithms fully solved how to handle imbalanced data, penalized approaches such as Firth and Log-F reduced the difference between the built-in odds ratio and value determined by the algorithm. Our results suggest that EDM studies might contain false negatives when determining which variables are related to an outcome. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImbalanced Data Classification Techniques · Online Learning and Analytics · Statistical Methods in Epidemiology
