Synthetic Data Generation with LLM for Improved Depression Prediction
Andrea Kang, Jun Yu Chen, Zoe Lee-Youngzie, Shuhao Fu

TL;DR
This paper presents a novel pipeline using Large Language Models to generate synthetic clinical interview data, improving depression prediction accuracy while addressing data privacy and scarcity issues.
Contribution
The study introduces a chain-of-thought prompting method with LLMs to create synthetic, privacy-preserving data that balances depression severity distribution for better model training.
Findings
Synthetic data achieved high fidelity and privacy metrics
Balanced depression severity distribution improved prediction performance
Method effectively addresses data scarcity and privacy concerns
Abstract
Automatic detection of depression is a rapidly growing field of research at the intersection of psychology and machine learning. However, with its exponential interest comes a growing concern for data privacy and scarcity due to the sensitivity of such a topic. In this paper, we propose a pipeline for Large Language Models (LLMs) to generate synthetic data to improve the performance of depression prediction models. Starting from unstructured, naturalistic text data from recorded transcripts of clinical interviews, we utilize an open-source LLM to generate synthetic data through chain-of-thought prompting. This pipeline involves two key steps: the first step is the generation of the synopsis and sentiment analysis based on the original transcript and depression score, while the second is the generation of the synthetic synopsis/sentiment analysis based on the summaries generated in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBrain Tumor Detection and Classification
