Knowledge-Guided Retrieval-Augmented Generation for Zero-Shot Psychiatric Data: Privacy Preserving Synthetic Data Generation

Adam Jakobsen; Sushant Gautam; Hugo Lewi Hammer; Susanne Olofsdotter; Miriam S Johanson; P{\aa}l Halvorsen; Vajira Thambawita

arXiv:2603.25186·cs.LG·March 27, 2026

Knowledge-Guided Retrieval-Augmented Generation for Zero-Shot Psychiatric Data: Privacy Preserving Synthetic Data Generation

Adam Jakobsen, Sushant Gautam, Hugo Lewi Hammer, Susanne Olofsdotter, Miriam S Johanson, P{\aa}l Halvorsen, Vajira Thambawita

PDF

Open Access

TL;DR

This paper introduces a zero-shot, knowledge-guided framework using large language models to generate privacy-preserving synthetic psychiatric data, outperforming traditional models in fidelity and privacy risk when real data sharing is restricted.

Contribution

The study presents a novel knowledge-guided LLM approach for synthetic psychiatric data generation that enhances data fidelity and privacy compared to existing deep learning models.

Findings

01

Knowledge-guided LLM achieves competitive pairwise structure fidelity.

02

Clinical retrieval improves univariate and pairwise data fidelity.

03

Real data-free LLM shows low privacy risk similar to state-of-the-art models.

Abstract

AI systems in healthcare research have shown potential to increase patient throughput and assist clinicians, yet progress is constrained by limited access to real patient data. To address this issue, we present a zero-shot, knowledge-guided framework for psychiatric tabular data in which large language models (LLMs) are steered via Retrieval-Augmented Generation using the Diagnostic and Statistical Manual of Mental Disorders (DSM-5) and the International Classification of Diseases (ICD-10). We conducted experiments using different combinations of knowledge bases to generate privacy-preserving synthetic data. The resulting models were benchmarked against two state-of-the-art deep learning models for synthetic tabular data generation, namely CTGAN and TVAE, both of which rely on real data and therefore entail potential privacy risks. Evaluation was performed on six anxiety-related…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Mental Health via Writing · Digital Mental Health Interventions