Exploring the Benefit of Activation Sparsity in Pre-training

Zhengyan Zhang; Chaojun Xiao; Qiujieli Qin; Yankai Lin; Zhiyuan Zeng,; Xu Han; Zhiyuan Liu; Ruobing Xie; Maosong Sun; Jie Zhou

arXiv:2410.03440·cs.CL·October 7, 2024

Exploring the Benefit of Activation Sparsity in Pre-training

Zhengyan Zhang, Chaojun Xiao, Qiujieli Qin, Yankai Lin, Zhiyuan Zeng,, Xu Han, Zhiyuan Liu, Ruobing Xie, Maosong Sun, Jie Zhou

PDF

Open Access 1 Repo

TL;DR

This paper investigates activation sparsity in pre-trained Transformers, introduces SSD to switch between sparse and dense training, and demonstrates efficiency gains and comparable performance with faster inference.

Contribution

It proposes Switchable Sparse-Dense Learning (SSD), a novel method that adaptively switches training modes to improve efficiency without sacrificing performance.

Findings

01

SSD reduces pre-training costs compared to dense training.

02

Models trained with SSD can be used as MoE models for faster inference.

03

SSD achieves comparable performance to dense training with up to 2x faster inference.

Abstract

Pre-trained Transformers inherently possess the characteristic of sparse activation, where only a small fraction of the neurons are activated for each token. While sparse activation has been explored through post-training methods, its potential in pre-training remains untapped. In this work, we first study how activation properties change during pre-training. Our examination reveals that Transformers exhibit sparse activation throughout the majority of the pre-training process while the activation correlation keeps evolving as training progresses. Leveraging this observation, we propose Switchable Sparse-Dense Learning (SSD). SSD adaptively switches between the Mixtures-of-Experts (MoE) based sparse training and the conventional dense training during the pre-training process, leveraging the efficiency of sparse training and avoiding the static activation correlation of sparse training.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thunlp/moefication
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMuscle activation and electromyography studies · Cardiovascular and exercise physiology · Sport Psychology and Performance

Methods1x1 Convolution · Non Maximum Suppression · Mixture of Experts · Convolution · SSD