# Imputing Knowledge Tracing Data with Subject-Based Training via LSTM   Variational Autoencoders Frameworks

**Authors:** Jia Tracy Shen, Dongwon Lee

arXiv: 2302.12910 · 2023-02-28

## TL;DR

This paper introduces a subject-based data imputation method using LSTM-augmented variational autoencoders to improve knowledge tracing models, demonstrating significant performance boosts with generated data.

## Contribution

It proposes a novel subject-based training approach combined with LSTM-VAE and LVAE frameworks for effective data imputation in knowledge tracing.

## Key findings

- Generated data boosts model performance by about 50%.
- Subject-based training retains complete student sequences for better learning.
- Imputed data reduces the need for additional student data to improve models.

## Abstract

The issue of missing data poses a great challenge on boosting performance and application of deep learning models in the {\em Knowledge Tracing} (KT) problem. However, there has been the lack of understanding on the issue in the literature. %are not sufficient studies tackling this problem. In this work, to address this challenge, we adopt a subject-based training method to split and impute data by student IDs instead of row number splitting which we call non-subject based training. The benefit of subject-based training can retain the complete sequence for each student and hence achieve efficient training. Further, we leverage two existing deep generative frameworks, namely variational Autoencoders (VAE) and Longitudinal Variational Autoencoders (LVAE) frameworks and build LSTM kernels into them to form LSTM-VAE and LSTM LVAE (noted as VAE and LVAE for simplicity) models to generate quality data. In LVAE, a Gaussian Process (GP) model is trained to disentangle the correlation between the subject (i.e., student) descriptor information (e.g., age, gender) and the latent space. The paper finally compare the model performance between training the original data and training the data imputed with generated data from non-subject based model VAE-NS and subject-based training models (i.e., VAE and LVAE). We demonstrate that the generated data from LSTM-VAE and LSTM-LVAE can boost the original model performance by about 50%. Moreover, the original model just needs 10% more student data to surpass the original performance if the prediction model is small and 50\% more data if the prediction model is large with our proposed frameworks.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2302.12910/full.md

## Figures

14 figures with captions in the complete paper: https://tomesphere.com/paper/2302.12910/full.md

## References

19 references — full list in the complete paper: https://tomesphere.com/paper/2302.12910/full.md

---
Source: https://tomesphere.com/paper/2302.12910