Synthetic Data in Education: Empirical Insights from Traditional Resampling and Deep Generative Models

Tapiwa Amion Chinodakufa; Ashfaq Ali Shafin; Khandaker Mamun Ahmed

arXiv:2604.21031·cs.LG·April 24, 2026

Synthetic Data in Education: Empirical Insights from Traditional Resampling and Deep Generative Models

Tapiwa Amion Chinodakufa, Ashfaq Ali Shafin, Khandaker Mamun Ahmed

PDF

1 Video

TL;DR

This paper systematically compares traditional resampling and deep generative models for synthetic educational data, revealing trade-offs between utility and privacy, and recommending Variational Autoencoders as a balanced solution.

Contribution

It provides the first benchmark comparing resampling and deep generative models for educational data synthesis, offering practical guidance for privacy and utility trade-offs.

Findings

01

Resampling methods achieve near-perfect utility but no privacy protection.

02

Deep learning models like VAEs offer strong privacy but lower utility.

03

Variational Autoencoders balance utility (83.3%) and privacy protection.

Abstract

Synthetic data generation offers promise for addressing data scarcity and privacy concerns in educational technology, yet practitioners lack empirical guidance for selecting between traditional resampling techniques and modern deep learning approaches. This study presents the first systematic benchmark comparing these paradigms using a 10,000-record student performance dataset. We evaluate three resampling methods (SMOTE, Bootstrap, Random Oversampling) against three deep learning models (Autoencoder, Variational Autoencoder, Copula-GAN) across multiple dimensions: distributional fidelity (Kolmogorov-Smirnov distance, Jensen-Shannon divergence), machine learning utility such as Train-on-Synthetic-Test-on-Real scores (TSTR), and privacy preservation (Distance to Closest Record). Our findings reveal a fundamental trade-off: resampling methods achieve near-perfect utility (TSTR: 0.997) but…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Synthetic Data in Education: Empirical Insights from Traditional Resampling and Deep Generative Models· underline