HAD: Hybrid Architecture Distillation Outperforms Teacher in Genomic Sequence Modeling
Hexiong Yang, Mingrui Chen, Huaibo Huang, Junxian Duan, Jie Cao, Zhen Zhou, Ran He

TL;DR
The paper introduces a Hybrid Architecture Distillation (HAD) method that uses distillation and reconstruction tasks to pre-train genomic sequence models more efficiently, surpassing larger teacher models in performance.
Contribution
The novel HAD approach combines distillation and reconstruction for genomic modeling, enabling smaller models to outperform larger teachers.
Findings
HAD achieves state-of-the-art results on genomic benchmarks.
HAD surpasses the performance of a 500x larger teacher model on some tasks.
Visualization shows HAD captures intrinsic genomic sequence patterns.
Abstract
Inspired by the great success of Masked Language Modeling (MLM) in the natural language domain, the paradigm of self-supervised pre-training and fine-tuning has also achieved remarkable progress in the field of DNA sequence modeling. However, previous methods often relied on massive pre-training data or large-scale base models with huge parameters, imposing a significant computational burden. To address this, many works attempted to use more compact models to achieve similar outcomes but still fell short by a considerable margin. In this work, we propose a Hybrid Architecture Distillation (HAD) approach, leveraging both distillation and reconstruction tasks for more efficient and effective pre-training. Specifically, we employ the NTv2-500M as the teacher model and devise a grouping masking strategy to align the feature embeddings of visible tokens while concurrently reconstructing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGene expression and cancer classification · Genomics and Phylogenetic Studies · Algorithms and Data Compression
MethodsAttention Is All You Need · Linear Layer · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Balanced Selection · Label Smoothing · Multi-Head Attention · Layer Normalization
