Comparing Reconstruction Attacks on Pretrained Versus Full Fine-tuned Large Language Model Embeddings on Homo Sapiens Splice Sites Genomic Data
Reem Al-Saidi, Erman Ayday, Ziad Kobti

TL;DR
This paper compares the vulnerability of pretrained and fine-tuned large language model embeddings to reconstruction attacks on genomic data, revealing that fine-tuning generally enhances privacy protection across multiple architectures.
Contribution
It extends prior work by applying reconstruction attacks to both pretrained and fine-tuned models, using DNA-specific tokenization, and analyzing privacy shifts across genomic positions.
Findings
Fine-tuning increases resistance to reconstruction attacks in XLNet, GPT-2, and BERT.
Specialized DNA tokenization improves model processing of genomic sequences.
Task-specific fine-tuning can serve as a privacy-enhancing mechanism.
Abstract
This study investigates embedding reconstruction attacks in large language models (LLMs) applied to genomic sequences, with a specific focus on how fine-tuning affects vulnerability to these attacks. Building upon Pan et al.'s seminal work demonstrating that embeddings from pretrained language models can leak sensitive information, we conduct a comprehensive analysis using the HS3D genomic dataset to determine whether task-specific optimization strengthens or weakens privacy protections. Our research extends Pan et al.'s work in three significant dimensions. First, we apply their reconstruction attack pipeline to pretrained and fine-tuned model embeddings, addressing a critical gap in their methodology that did not specify embedding types. Second, we implement specialized tokenization mechanisms tailored specifically for DNA sequences, enhancing the model's ability to process genomic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Genomics and Rare Diseases · Adversarial Robustness in Machine Learning
