Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated   Parameters

Yixin Song; Haotong Xie; Zhengyan Zhang; Bo Wen; Li Ma; Zeyu Mi; and; Haibo Chen

arXiv:2406.05955·cs.LG·June 12, 2024·1 cites

Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters

Yixin Song, Haotong Xie, Zhengyan Zhang, Bo Wen, Li Ma, Zeyu Mi, and, Haibo Chen

PDF

Open Access 1 Repo 10 Models

TL;DR

Turbo Sparse introduces a novel activation function and training strategy to significantly increase activation sparsity in large language models, enabling faster inference with minimal performance loss and practical speedups on mobile devices.

Contribution

The paper proposes a new dReLU activation function and training data mixture ratio, along with leveraging sparse patterns in MoE models, to enhance activation sparsity and inference efficiency in LLMs.

Findings

01

Achieves 2-5x decoding speedup in large models.

02

Only 2.5B and 4.3B parameters activated per inference.

03

Mobile inference speed reaches 11 tokens/sec.

Abstract

Exploiting activation sparsity is a promising approach to significantly accelerating the inference process of large language models (LLMs) without compromising performance. However, activation sparsity is determined by activation functions, and commonly used ones like SwiGLU and GeGLU exhibit limited sparsity. Simply replacing these functions with ReLU fails to achieve sufficient sparsity. Moreover, inadequate training data can further increase the risk of performance degradation. To address these challenges, we propose a novel dReLU function, which is designed to improve LLM activation sparsity, along with a high-quality training data mixture ratio to facilitate effective sparsification. Additionally, we leverage sparse activation patterns within the Feed-Forward Network (FFN) experts of Mixture-of-Experts (MoE) models to further boost efficiency. By applying our neuron sparsification…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sjtu-ipads/powerinfer
none

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBlind Source Separation Techniques · Neural Networks and Applications · Advanced Wireless Communication Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · GeGLU · SwiGLU