HAD: Hybrid Architecture Distillation Outperforms Teacher in Genomic Sequence Modeling

Hexiong Yang; Mingrui Chen; Huaibo Huang; Junxian Duan; Jie Cao; Zhen Zhou; Ran He

arXiv:2505.20836·cs.LG·May 28, 2025

HAD: Hybrid Architecture Distillation Outperforms Teacher in Genomic Sequence Modeling

Hexiong Yang, Mingrui Chen, Huaibo Huang, Junxian Duan, Jie Cao, Zhen Zhou, Ran He

PDF

Open Access

TL;DR

The paper introduces a Hybrid Architecture Distillation (HAD) method that uses distillation and reconstruction tasks to pre-train genomic sequence models more efficiently, surpassing larger teacher models in performance.

Contribution

The novel HAD approach combines distillation and reconstruction for genomic modeling, enabling smaller models to outperform larger teachers.

Findings

01

HAD achieves state-of-the-art results on genomic benchmarks.

02

HAD surpasses the performance of a 500x larger teacher model on some tasks.

03

Visualization shows HAD captures intrinsic genomic sequence patterns.

Abstract

Inspired by the great success of Masked Language Modeling (MLM) in the natural language domain, the paradigm of self-supervised pre-training and fine-tuning has also achieved remarkable progress in the field of DNA sequence modeling. However, previous methods often relied on massive pre-training data or large-scale base models with huge parameters, imposing a significant computational burden. To address this, many works attempted to use more compact models to achieve similar outcomes but still fell short by a considerable margin. In this work, we propose a Hybrid Architecture Distillation (HAD) approach, leveraging both distillation and reconstruction tasks for more efficient and effective pre-training. Specifically, we employ the NTv2-500M as the teacher model and devise a grouping masking strategy to align the feature embeddings of visible tokens while concurrently reconstructing the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGene expression and cancer classification · Genomics and Phylogenetic Studies · Algorithms and Data Compression

MethodsAttention Is All You Need · Linear Layer · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Balanced Selection · Label Smoothing · Multi-Head Attention · Layer Normalization