Distilling Genomic Models for Efficient mRNA Representation Learning via Embedding Matching
Rasched Haidari, Sam Martin, Maxime Allard

TL;DR
This paper introduces a method to distill large genomic models into smaller, efficient models for mRNA sequence learning, maintaining high performance and enabling scalable genomics applications.
Contribution
The authors develop an embedding-level distillation framework that significantly reduces model size while preserving state-of-the-art performance in mRNA tasks.
Findings
Distilled models achieve state-of-the-art results among similarly sized models.
Embedding-based distillation is more stable and effective than logit-based methods.
The approach enables scalable and efficient mRNA sequence modeling in genomics.
Abstract
Large Genomic Foundation Models have recently achieved remarkable results and in-vivo translation capabilities. However these models quickly grow to over a few Billion of parameters and are expensive to run when compute is limited. To overcome this challenge, we present a distillation framework for transferring mRNA representations from a state of the art genomic foundation model into a much smaller model specialized for mRNA sequences, reducing the size by 200-fold. Embedding-level distillation worked better than logit based methods, which we found unstable. Benchmarking on mRNA-bench demonstrates that the distilled model achieves state-of-the-art performance among models of comparable size and competes with larger architectures for mRNA-related tasks. Our results highlight embedding-based distillation of mRNA sequences as an effective training strategy for biological foundation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
