Quantifying Memorization and Privacy Risks in Genomic Language Models

Alexander Nemecek; Wenbiao Li; Xiaoqian Jiang; Jaideep Vaidya; Erman Ayday

arXiv:2603.08913·cs.LG·March 11, 2026

Quantifying Memorization and Privacy Risks in Genomic Language Models

Alexander Nemecek, Wenbiao Li, Xiaoqian Jiang, Jaideep Vaidya, Erman Ayday

PDF

Open Access

TL;DR

This paper develops a comprehensive framework to quantify memorization and privacy risks in genomic language models, revealing that these models can memorize sensitive sequences and emphasizing the importance of multi-method privacy auditing.

Contribution

It introduces a multi-vector evaluation framework combining perplexity, canary extraction, and membership inference to assess memorization risks in genomic language models.

Findings

01

GLMs show measurable memorization of training data.

02

Memorization varies with model architecture and training regime.

03

Multi-vector assessment is essential for accurate privacy risk evaluation.

Abstract

Genomic language models (GLMs) have emerged as powerful tools for learning representations of DNA sequences, enabling advances in variant prediction, regulatory element identification, and cross-task transfer learning. However, as these models are increasingly trained or fine-tuned on sensitive genomic cohorts, they risk memorizing specific sequences from their training data, raising serious concerns around privacy, data leakage, and regulatory compliance. Despite growing awareness of memorization risks in general-purpose language models, little systematic evaluation exists for these risks in the genomic domain, where data exhibit unique properties such as a fixed nucleotide alphabet, strong biological structure, and individual identifiability. We present a comprehensive, multi-vector privacy evaluation framework designed to quantify memorization risks in GLMs. Our approach integrates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenomics and Rare Diseases · Epigenetics and DNA Methylation · Genomics and Chromatin Dynamics