Exploiting Activation Sparsity with Dense to Dynamic-k   Mixture-of-Experts Conversion

Filip Szatkowski; Bartosz W\'ojcik; Miko{\l}aj Pi\'orczy\'nski; Simone; Scardapane

arXiv:2310.04361·cs.LG·November 13, 2024·1 cites

Exploiting Activation Sparsity with Dense to Dynamic-k Mixture-of-Experts Conversion

Filip Szatkowski, Bartosz W\'ojcik, Miko{\l}aj Pi\'orczy\'nski, Simone, Scardapane

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper introduces D2DMoE, a method that leverages activation sparsity in transformers to convert parts into dynamic-k MoE layers, significantly reducing inference costs while maintaining performance.

Contribution

It proposes a novel regularization and dynamic expert selection technique for activation sparsity, enabling efficient conversion to MoE layers with practical speedups.

Findings

01

Reduces inference cost by up to 60%

02

Outperforms existing methods on NLP and vision tasks

03

Maintains model performance with significant efficiency gains

Abstract

Transformer models can face practical limitations due to their high computational requirements. At the same time, such models exhibit significant activation sparsity, which can be leveraged to reduce the inference cost by converting parts of the network into equivalent Mixture-of-Experts (MoE) layers. Despite the crucial role played by activation sparsity, its impact on this process remains unexplored. We demonstrate that the efficiency of the conversion can be significantly enhanced by a proper regularization of the activation sparsity of the base model. Moreover, motivated by the high variance of the number of activated neurons for different inputs, we introduce a more effective dynamic- $k$ expert selection rule that adjusts the number of executed experts on a per-token basis. To achieve further savings, we extend this approach to multi-head attention projections. Finally, we develop…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Exploiting Activation Sparsity with Dense to Dynamic-k Mixture-of-Experts Conversion· slideslive

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Neural Networks and Applications · Advanced Neural Network Applications

MethodsBalanced Selection · Attention Is All You Need · Byte Pair Encoding · Dense Connections · Label Smoothing · Adam · Absolute Position Encodings · Residual Connection · Layer Normalization · Linear Layer