Privacy-Preserving Synthetic Educational Data Generation

Jill-J\^enn Vie (SODA); Tomas Rigaux (SODA); Sein Minn (CEDAR)

arXiv:2207.03202·cs.CY·July 9, 2022

Privacy-Preserving Synthetic Educational Data Generation

Jill-J\^enn Vie (SODA), Tomas Rigaux (SODA), Sein Minn (CEDAR)

PDF

1 Repo

TL;DR

This paper introduces a privacy-preserving generative model for creating synthetic educational data, enabling research while protecting participant privacy, and provides an evaluation framework for comparing such models.

Contribution

It presents a novel generative model for educational data that ensures privacy and an evaluation framework to assess synthetic data quality and privacy guarantees.

Findings

01

Naive pseudonymization can lead to re-identification risks.

02

Proposed techniques effectively preserve privacy in synthetic data.

03

Evaluations show the method's utility on large educational datasets.

Abstract

Institutions collect massive learning traces but they may not disclose it for privacy issues. Synthetic data generation opens new opportunities for research in education. In this paper we present a generative model for educational data that can preserve the privacy of participants, and an evaluation framework for comparing synthetic data generators. We show how naive pseudonymization can lead to re-identification threats and suggest techniques to guarantee privacy. We evaluate our method on existing massive educational open datasets.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

akulen/privgen
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.