Embedding Is (Almost) All You Need: Retrieval-Augmented Inference for Generalizable Genomic Prediction Tasks

Nirjhor Datta; Swakkhar Shatabda; M Sohel Rahman

arXiv:2508.04757·q-bio.GN·August 8, 2025

Embedding Is (Almost) All You Need: Retrieval-Augmented Inference for Generalizable Genomic Prediction Tasks

Nirjhor Datta, Swakkhar Shatabda, M Sohel Rahman

PDF

TL;DR

This paper demonstrates that embedding-based methods using pre-trained DNA language models can achieve competitive or superior performance to fine-tuning in genomic prediction tasks, with significantly improved efficiency and generalizability.

Contribution

It introduces a simple embedding-based pipeline as a more efficient and generalizable alternative to fine-tuning for genomic prediction tasks.

Findings

01

Embedding methods outperform fine-tuning in diverse data distributions.

02

Embedding pipelines reduce inference time by 10x to 20x.

03

Embedding approaches lower carbon emissions significantly.

Abstract

Large pre-trained DNA language models such as DNABERT-2, Nucleotide Transformer, and HyenaDNA have demonstrated strong performance on various genomic benchmarks. However, most applications rely on expensive fine-tuning, which works best when the training and test data share a similar distribution. In this work, we investigate whether task-specific fine-tuning is always necessary. We show that simple embedding-based pipelines that extract fixed representations from these models and feed them into lightweight classifiers can achieve competitive performance. In evaluation settings with different data distributions, embedding-based methods often outperform fine-tuning while reducing inference time by 10x to 20x. Our results suggest that embedding extraction is not only a strong baseline but also a more generalizable and efficient alternative to fine-tuning, especially for deployment in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.