Creating Artificial Students that Never Existed: Leveraging Large Language Models and CTGANs for Synthetic Data Generation

Mohammad Khalil; Sam Urmian; Ronas Shakya; Qinyi Liu

arXiv:2501.01793·cs.LG·May 21, 2026

Creating Artificial Students that Never Existed: Leveraging Large Language Models and CTGANs for Synthetic Data Generation

Mohammad Khalil, Sam Urmian, Ronas Shakya, Qinyi Liu

PDF

1 Repo

TL;DR

This paper investigates the use of GANs and LLMs to generate high-quality synthetic student data, addressing privacy concerns and enhancing learning analytics research.

Contribution

It demonstrates the effectiveness of CTGAN and several LLMs in creating realistic synthetic student datasets for learning analytics applications.

Findings

01

Synthetic data closely resembles real student data in statistical properties.

02

LLMs outperform traditional GANs in certain utility metrics.

03

Synthetic datasets improve model training while preserving privacy.

Abstract

In this study, we explore the growing potential of AI and deep learning technologies, particularly Generative Adversarial Networks (GANs) and Large Language Models (LLMs), for generating synthetic tabular data. Access to quality students data is critical for advancing learning analytics, but privacy concerns and stricter data protection regulations worldwide limit their availability and usage. Synthetic data offers a promising alternative. We investigate whether synthetic data can be leveraged to create artificial students for serving learning analytics models. Using the popular GAN model CTGAN and three LLMs- GPT2, DistilGPT2, and DialoGPT, we generate synthetic tabular student data. Our results demonstrate the strong potential of these methods to produce high-quality synthetic datasets that resemble real students data. To validate our findings, we apply a comprehensive set of utility…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mohdkhalil/repository-supplementary-for-lak-25-paper--creating-artificial-students-that-never-existed
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsSparse Evolutionary Training