Distill-then-Replace: Efficient Task-Specific Hybrid Attention Model Construction
Xiaojie Xia, Huigang Zhang, Chaoliang Zhong, Jun Sun, Yusuke Oishi

TL;DR
This paper presents a method to efficiently create task-specific hybrid attention models by transferring weights from pretrained full-attention models to linear attention counterparts and iteratively replacing layers, avoiding costly re-training.
Contribution
It introduces a novel weight transfer and greedy layer replacement approach for constructing hybrid attention models from pretrained transformers.
Findings
Achieves efficient hybrid models without re-training.
Maintains high performance with linear attention layers.
Applicable to various pretrained backbones and tasks.
Abstract
Transformer architectures deliver state-of-the-art accuracy via dense full-attention, but their quadratic time and memory complexity with respect to sequence length limits practical deployment. Linear attention mechanisms offer linear or near-linear scaling yet often incur performance degradation. Hybrid models that integrate full and linear attention layers promise a balance between efficiency and expressiveness, but face two major challenges: training such hybrid models from scratch is computationally expensive, and manually designing the optimal placement of attention types is highly nontrivial. We address both issues by first transferring weights from the pretrained full-attention modules to its linear attention counterparts through blockwise local distillation, and second, introducing a greedy layer replacement strategy that iteratively substitutes full attention blocks with linear…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · EEG and Brain-Computer Interfaces · Advanced Memory and Neural Computing
