Mixture of Hidden-Dimensions Transformer

Yilong Chen; Junyuan Shang; Zhengyu Zhang; Jiawei Sheng; Tingwen Liu,; Shuohuan Wang; Yu Sun; Hua Wu; Haifeng Wang

arXiv:2412.05644·cs.CL·December 17, 2024

Mixture of Hidden-Dimensions Transformer

Yilong Chen, Junyuan Shang, Zhengyu Zhang, Jiawei Sheng, Tingwen Liu,, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang

PDF

Open Access

TL;DR

This paper introduces MoHD, a sparse, dynamic architecture for Transformers that efficiently scales hidden dimensions by activating relevant sub-dimensions, improving performance and parameter efficiency across NLP tasks.

Contribution

MoHD is a novel sparse conditional activation architecture that leverages hidden dimension sparsity for efficient scaling and improved performance in Transformer models.

Findings

01

MoHD achieves 1.7% higher performance with 50% fewer parameters.

02

MoHD surpasses vanilla Transformers in 10 NLP tasks.

03

Efficiently expands hidden dimensions with minimal computation increase.

Abstract

Transformer models encounter challenges in scaling hidden dimensions efficiently, as uniformly increasing them inflates computational and memory costs while failing to emphasize the most relevant features for each token. For further understanding, we study hidden dimension sparsity and observe that trained Transformers utilize only a small fraction of token dimensions, revealing an "activation flow" pattern. Notably, there are shared sub-dimensions with sustained activation across multiple consecutive tokens and specialized sub-dimensions uniquely activated for each token. To better model token-relevant sub-dimensions, we propose MoHD (Mixture of Hidden Dimensions), a sparse conditional activation architecture. Particularly, MoHD employs shared sub-dimensions for common token features and a routing mechanism to dynamically activate specialized sub-dimensions. To mitigate potential…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSensor Technology and Measurement Systems