LLM-based Privacy Data Augmentation Guided by Knowledge Distillation   with a Distribution Tutor for Medical Text Classification

Yiping Song; Juhua Zhang; Zhiliang Tian; Yuxin Yang; Minlie Huang,; Dongsheng Li

arXiv:2402.16515·cs.CL·February 27, 2024·1 cites

LLM-based Privacy Data Augmentation Guided by Knowledge Distillation with a Distribution Tutor for Medical Text Classification

Yiping Song, Juhua Zhang, Zhiliang Tian, Yuxin Yang, Minlie Huang,, Dongsheng Li

PDF

Open Access

TL;DR

This paper introduces a novel privacy-preserving data augmentation method for medical text classification using large language models, knowledge distillation, and a distribution tutor to generate pseudo samples with privacy guarantees.

Contribution

It proposes a DP-based data augmentation framework utilizing a knowledge distillation model and a distribution tutor to enhance privacy and data quality in private domain text classification.

Findings

01

The method effectively generates private pseudo samples with strong privacy guarantees.

02

Empirical results show improved classification performance with privacy protection.

03

Theoretical analysis confirms the privacy bounds of the proposed approach.

Abstract

As sufficient data are not always publically accessible for model training, researchers exploit limited data with advanced learning algorithms or expand the dataset via data augmentation (DA). Conducting DA in private domain requires private protection approaches (i.e. anonymization and perturbation), but those methods cannot provide protection guarantees. Differential privacy (DP) learning methods theoretically bound the protection but are not skilled at generating pseudo text samples with large models. In this paper, we transfer DP-based pseudo sample generation task to DP-based generated samples discrimination task, where we propose a DP-based DA method with a LLM and a DP-based discriminator for text classification on private domains. We construct a knowledge distillation model as the DP-based discriminator: teacher models, accessing private data, teaches students how to select…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · AI in cancer detection · Imbalanced Data Classification Techniques

MethodsKnowledge Distillation