Safe Training with Sensitive In-domain Data: Leveraging Data   Fragmentation To Mitigate Linkage Attacks

Mariia Ignashina; Julia Ive

arXiv:2404.19486·cs.CL·May 1, 2024

Safe Training with Sensitive In-domain Data: Leveraging Data Fragmentation To Mitigate Linkage Attacks

Mariia Ignashina, Julia Ive

PDF

Open Access

TL;DR

This paper proposes a method to enhance privacy in text generation models by training on fragmented, domain-specific data to prevent sensitive information leakage and linkage attacks.

Contribution

It introduces a data fragmentation approach for training language models, reducing re-identification risk while maintaining classification performance.

Findings

01

Fragmented data training achieves comparable results to full data training.

02

Models trained on fragments are less susceptible to linkage attacks.

03

Fine-tuned models effectively predict cardiovascular diagnoses.

Abstract

Current text generation models are trained using real data which can potentially contain sensitive information, such as confidential patient information and the like. Under certain conditions output of the training data which they have memorised can be triggered, exposing sensitive data. To mitigate against this risk we propose a safer alternative which sees fragmented data in the form of domain-specific short phrases randomly grouped together shared instead of full texts. Thus, text fragments that could re-identify an individual cannot be reproduced by the model in one sequence, giving significant protection against linkage attacks. We fine-tune several state-of-the-art LLMs using meaningful syntactic chunks to explore their utility. In particular, we fine-tune BERT-based models to predict two cardiovascular diagnoses. Our results demonstrate the capacity of LLMs to benefit from the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning