Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters
Yixin Song, Haotong Xie, Zhengyan Zhang, Bo Wen, Li Ma, Zeyu Mi, and, Haibo Chen

TL;DR
Turbo Sparse introduces a novel activation function and training strategy to significantly increase activation sparsity in large language models, enabling faster inference with minimal performance loss and practical speedups on mobile devices.
Contribution
The paper proposes a new dReLU activation function and training data mixture ratio, along with leveraging sparse patterns in MoE models, to enhance activation sparsity and inference efficiency in LLMs.
Findings
Achieves 2-5x decoding speedup in large models.
Only 2.5B and 4.3B parameters activated per inference.
Mobile inference speed reaches 11 tokens/sec.
Abstract
Exploiting activation sparsity is a promising approach to significantly accelerating the inference process of large language models (LLMs) without compromising performance. However, activation sparsity is determined by activation functions, and commonly used ones like SwiGLU and GeGLU exhibit limited sparsity. Simply replacing these functions with ReLU fails to achieve sufficient sparsity. Moreover, inadequate training data can further increase the risk of performance degradation. To address these challenges, we propose a novel dReLU function, which is designed to improve LLM activation sparsity, along with a high-quality training data mixture ratio to facilitate effective sparsification. Additionally, we leverage sparse activation patterns within the Feed-Forward Network (FFN) experts of Mixture-of-Experts (MoE) models to further boost efficiency. By applying our neuron sparsification…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Tiiny/TurboSparse-Mixtralmodel· 20 dl· ♡ 4220 dl♡ 42
- 🤗Tiiny/TurboSparse-Mixtral-Instructmodel· 8 dl· ♡ 288 dl♡ 28
- 🤗Tiiny/TurboSparse-Mistral-Instructmodel· 17 dl· ♡ 2317 dl♡ 23
- 🤗sunatte/txt2sqlmodel
- 🤗MachoMaheen/devdock4bitmodel
- 🤗Tiiny/SparseQwen2-7Bmodel· 15 dl· ♡ 515 dl♡ 5
- 🤗sicer/arc-agi-legacymodel
- 🤗JilinHu/llemma_7b_3epoch_r32_e5_RQ1model· 1 dl1 dl
- 🤗Xin-Rui/LLAMA-Fac-NEW-A800model· ♡ 1♡ 1
- 🤗Linksome/lmfmodel
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBlind Source Separation Techniques · Neural Networks and Applications · Advanced Wireless Communication Techniques
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · GeGLU · SwiGLU
