Comparing Reconstruction Attacks on Pretrained Versus Full Fine-tuned Large Language Model Embeddings on Homo Sapiens Splice Sites Genomic Data

Reem Al-Saidi; Erman Ayday; Ziad Kobti

arXiv:2511.07481·cs.LG·November 12, 2025

Comparing Reconstruction Attacks on Pretrained Versus Full Fine-tuned Large Language Model Embeddings on Homo Sapiens Splice Sites Genomic Data

Reem Al-Saidi, Erman Ayday, Ziad Kobti

PDF

Open Access

TL;DR

This paper compares the vulnerability of pretrained and fine-tuned large language model embeddings to reconstruction attacks on genomic data, revealing that fine-tuning generally enhances privacy protection across multiple architectures.

Contribution

It extends prior work by applying reconstruction attacks to both pretrained and fine-tuned models, using DNA-specific tokenization, and analyzing privacy shifts across genomic positions.

Findings

01

Fine-tuning increases resistance to reconstruction attacks in XLNet, GPT-2, and BERT.

02

Specialized DNA tokenization improves model processing of genomic sequences.

03

Task-specific fine-tuning can serve as a privacy-enhancing mechanism.

Abstract

This study investigates embedding reconstruction attacks in large language models (LLMs) applied to genomic sequences, with a specific focus on how fine-tuning affects vulnerability to these attacks. Building upon Pan et al.'s seminal work demonstrating that embeddings from pretrained language models can leak sensitive information, we conduct a comprehensive analysis using the HS3D genomic dataset to determine whether task-specific optimization strengthens or weakens privacy protections. Our research extends Pan et al.'s work in three significant dimensions. First, we apply their reconstruction attack pipeline to pretrained and fine-tuned model embeddings, addressing a critical gap in their methodology that did not specify embedding types. Second, we implement specialized tokenization mechanisms tailored specifically for DNA sequences, enhancing the model's ability to process genomic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Genomics and Rare Diseases · Adversarial Robustness in Machine Learning