Stable and Privacy-Preserving Synthetic Educational Data with Empirical Marginals: A Copula-Based Approach
Gabriel Diaz Ramos, Lorenzo Luzi, Debshila Basu Mallick, and Richard Baraniuk

TL;DR
This paper introduces NPGC, a copula-based method for generating privacy-preserving synthetic educational data that maintains marginal distributions and dependencies more reliably than existing deep learning approaches.
Contribution
The paper presents NPGC, a non-parametric, empirical approach that preserves data marginals and dependencies, supports heterogeneous variables, and integrates differential privacy for educational data synthesis.
Findings
NPGC remains stable over multiple regeneration cycles.
NPGC achieves competitive downstream performance.
NPGC is computationally more efficient than deep learning baselines.
Abstract
To advance Educational Data Mining (EDM) within strict privacy-protecting regulatory frameworks, researchers must develop methods that enable data-driven analysis while protecting sensitive student information. Synthetic data generation is one such approach, enabling the release of statistically generated samples instead of real student records; however, existing deep learning and parametric generators often distort marginal distributions and degrade under iterative regeneration, leading to distribution drift and progressive loss of distributional support that compromise reliability. In response, we introduce the Non-Parametric Gaussian Copula (NPGC), a plug-and-play synthesis method that replaces deep learning and parametric optimization with empirical statistical anchoring to preserve the observed marginal distributions while modeling dependencies through a copula framework. NPGC…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
