Embedding Is (Almost) All You Need: Retrieval-Augmented Inference for Generalizable Genomic Prediction Tasks
Nirjhor Datta, Swakkhar Shatabda, M Sohel Rahman

TL;DR
This paper demonstrates that embedding-based methods using pre-trained DNA language models can achieve competitive or superior performance to fine-tuning in genomic prediction tasks, with significantly improved efficiency and generalizability.
Contribution
It introduces a simple embedding-based pipeline as a more efficient and generalizable alternative to fine-tuning for genomic prediction tasks.
Findings
Embedding methods outperform fine-tuning in diverse data distributions.
Embedding pipelines reduce inference time by 10x to 20x.
Embedding approaches lower carbon emissions significantly.
Abstract
Large pre-trained DNA language models such as DNABERT-2, Nucleotide Transformer, and HyenaDNA have demonstrated strong performance on various genomic benchmarks. However, most applications rely on expensive fine-tuning, which works best when the training and test data share a similar distribution. In this work, we investigate whether task-specific fine-tuning is always necessary. We show that simple embedding-based pipelines that extract fixed representations from these models and feed them into lightweight classifiers can achieve competitive performance. In evaluation settings with different data distributions, embedding-based methods often outperform fine-tuning while reducing inference time by 10x to 20x. Our results suggest that embedding extraction is not only a strong baseline but also a more generalizable and efficient alternative to fine-tuning, especially for deployment in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
