TL;DR
BlockFFN introduces a novel mixture-of-experts architecture with chunk-level activation sparsity, enhancing acceleration and efficiency for large language models on resource-constrained devices.
Contribution
The paper proposes BlockFFN with differentiable routing and training objectives promoting both token-level and chunk-level sparsity, enabling efficient acceleration techniques.
Findings
Achieves over 80% token-level sparsity and 70% chunk-level sparsity.
Up to 3.67× speedup on end-side devices compared to dense models.
Demonstrates superior performance over other MoE baselines.
Abstract
To alleviate the computational burden of large language models (LLMs), architectures with activation sparsity, represented by mixture-of-experts (MoE), have attracted increasing attention. However, the non-differentiable and inflexible routing of vanilla MoE hurts model performance. Moreover, while each token activates only a few parameters, these sparsely-activated architectures exhibit low chunk-level sparsity, indicating that the union of multiple consecutive tokens activates a large ratio of parameters. Such a sparsity pattern is unfriendly for acceleration under low-resource conditions (e.g., end-side devices) and incompatible with mainstream acceleration techniques (e.g., speculative decoding). To address these challenges, we introduce a novel MoE architecture, BlockFFN, as well as its efficient training and deployment techniques. Specifically, we use a router integrating ReLU…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
