How Private Are DNA Embeddings? Inverting Foundation Model Representations of Genomic Sequences
Sofiane Ouaari, Jules Kreuer, Nico Pfeifer

TL;DR
This paper investigates the privacy of DNA embeddings generated by foundation models, revealing that most are vulnerable to inversion attacks that can nearly perfectly reconstruct original sequences, highlighting the need for privacy-aware design.
Contribution
It provides the first comprehensive evaluation of DNA model inversion risks, demonstrating vulnerabilities and proposing considerations for privacy-preserving genomic embeddings.
Findings
Per-token embeddings enable near-perfect sequence reconstruction.
Mean-pooled embeddings' reconstruction quality decreases with sequence length.
Evo 2 and NTv2 are most vulnerable, DNABERT-2 shows greater resilience.
Abstract
DNA foundation models have become transformative tools in bioinformatics and healthcare applications. Trained on vast genomic datasets, these models can be used to generate sequence embeddings, dense vector representations that capture complex genomic information. These embeddings are increasingly being shared via Embeddings-as-a-Service (EaaS) frameworks to facilitate downstream tasks, while supposedly protecting the privacy of the underlying raw sequences. However, as this practice becomes more prevalent, the security of these representations is being called into question. This study evaluates the resilience of DNA foundation models to model inversion attacks, whereby adversaries attempt to reconstruct sensitive training data from model outputs. In our study, the model's output for reconstructing the DNA sequence is a zero-shot embedding, which is then fed to a decoder. We evaluated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Genome Rearrangement Algorithms · Genomics and Rare Diseases
