Sparse Modular Activation for Efficient Sequence Modeling
Liliang Ren, Yang Liu, Shuohang Wang, Yichong Xu, Chenguang Zhu,, ChengXiang Zhai

TL;DR
This paper introduces Sparse Modular Activation (SMA), a dynamic sparsity mechanism that improves the efficiency of sequence models by selectively activating sub-modules, enabling linear complexity and state-of-the-art results across various tasks.
Contribution
The paper proposes SMA, a novel differentiable mechanism for dynamic sparse activation of sub-modules, and designs SeqBoat, a new architecture leveraging SMA for efficient sequence modeling.
Findings
SeqBoat achieves linear inference complexity with state-of-the-art performance.
SMA reduces computation and memory usage during training and inference.
Learned sparse activation patterns reveal task-specific attention requirements.
Abstract
Recent hybrid models combining Linear State Space Models (SSMs) with self-attention mechanisms have demonstrated impressive results across a range of sequence modeling tasks. However, current approaches apply attention modules statically and uniformly to all elements in the input sequences, leading to sub-optimal quality-efficiency trade-offs. To address this limitation, we introduce Sparse Modular Activation (SMA), a general mechanism enabling neural networks to sparsely and dynamically activate sub-modules for sequence elements in a differentiable manner. Through allowing each element to skip non-activated sub-modules, SMA reduces computation and memory consumption of neural networks at both training and inference stages. To validate the effectiveness of SMA on sequence modeling, we design a novel neural architecture, SeqBoat, which employs SMA to sparsely activate a Gated Attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Speech Recognition and Synthesis · Domain Adaptation and Few-Shot Learning
MethodsSlime Mould Algorithm
