Mixture of Hidden-Dimensions Transformer
Yilong Chen, Junyuan Shang, Zhengyu Zhang, Jiawei Sheng, Tingwen Liu,, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang

TL;DR
This paper introduces MoHD, a sparse, dynamic architecture for Transformers that efficiently scales hidden dimensions by activating relevant sub-dimensions, improving performance and parameter efficiency across NLP tasks.
Contribution
MoHD is a novel sparse conditional activation architecture that leverages hidden dimension sparsity for efficient scaling and improved performance in Transformer models.
Findings
MoHD achieves 1.7% higher performance with 50% fewer parameters.
MoHD surpasses vanilla Transformers in 10 NLP tasks.
Efficiently expands hidden dimensions with minimal computation increase.
Abstract
Transformer models encounter challenges in scaling hidden dimensions efficiently, as uniformly increasing them inflates computational and memory costs while failing to emphasize the most relevant features for each token. For further understanding, we study hidden dimension sparsity and observe that trained Transformers utilize only a small fraction of token dimensions, revealing an "activation flow" pattern. Notably, there are shared sub-dimensions with sustained activation across multiple consecutive tokens and specialized sub-dimensions uniquely activated for each token. To better model token-relevant sub-dimensions, we propose MoHD (Mixture of Hidden Dimensions), a sparse conditional activation architecture. Particularly, MoHD employs shared sub-dimensions for common token features and a routing mechanism to dynamically activate specialized sub-dimensions. To mitigate potential…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSensor Technology and Measurement Systems
