From Sparsity to Simplicity: Enabling Simpler Sequential Replacements via Sparse Attention Distillation
Yuxin Ren, Maxwell D Collins, Miao Hu, Huanrui Yang

TL;DR
This paper explores replacing complex self-attention in transformers with simpler sequential modules by leveraging sparsity patterns, achieving efficient models with reduced parameters and latency.
Contribution
It introduces a layer-wise distillation framework that uses attention sparsity to effectively replace attention with simpler modules in pretrained vision transformers.
Findings
Sparser attention layers lead to smaller accuracy drops upon replacement.
Attention sparsity-guided distillation reduces the student-teacher gap.
Explicit attention sparsity improves efficiency in attention replacement.
Abstract
Self-attention serves as the core foundation of large-scale transformer pretraining, but its quadratic token interaction cost makes inference expensive. Replacing attention with simpler sequential modules is appealing, yet naive substitution is often lossy, especially at larger scales. This paper revisits attention replacement through the lens of sparsity. Based on the observation of diverse sparsity patterns across transformer layers, we posit that pretrained transformers decompose the complex token dependency across tokens into various sequence-to-sequence mappings of diverse complexities, where some layer functionalities can be approximated and replaced with much simpler sequential modules without loss. We evaluate this premise using a plug-and-play layer-wise distillation framework to approximate and replace attention functionalities in pretrained vision transformer models.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
