LLM-based Privacy Data Augmentation Guided by Knowledge Distillation with a Distribution Tutor for Medical Text Classification
Yiping Song, Juhua Zhang, Zhiliang Tian, Yuxin Yang, Minlie Huang,, Dongsheng Li

TL;DR
This paper introduces a novel privacy-preserving data augmentation method for medical text classification using large language models, knowledge distillation, and a distribution tutor to generate pseudo samples with privacy guarantees.
Contribution
It proposes a DP-based data augmentation framework utilizing a knowledge distillation model and a distribution tutor to enhance privacy and data quality in private domain text classification.
Findings
The method effectively generates private pseudo samples with strong privacy guarantees.
Empirical results show improved classification performance with privacy protection.
Theoretical analysis confirms the privacy bounds of the proposed approach.
Abstract
As sufficient data are not always publically accessible for model training, researchers exploit limited data with advanced learning algorithms or expand the dataset via data augmentation (DA). Conducting DA in private domain requires private protection approaches (i.e. anonymization and perturbation), but those methods cannot provide protection guarantees. Differential privacy (DP) learning methods theoretically bound the protection but are not skilled at generating pseudo text samples with large models. In this paper, we transfer DP-based pseudo sample generation task to DP-based generated samples discrimination task, where we propose a DP-based DA method with a LLM and a DP-based discriminator for text classification on private domains. We construct a knowledge distillation model as the DP-based discriminator: teacher models, accessing private data, teaches students how to select…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · AI in cancer detection · Imbalanced Data Classification Techniques
MethodsKnowledge Distillation
