BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity

Chenyang Song; Weilin Zhao; Xu Han; Chaojun Xiao; Yingfa Chen; Yuxuan Li; Zhiyuan Liu; Maosong Sun

arXiv:2507.08771·cs.LG·July 31, 2025

BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity

Chenyang Song, Weilin Zhao, Xu Han, Chaojun Xiao, Yingfa Chen, Yuxuan Li, Zhiyuan Liu, Maosong Sun

PDF

6 Models

TL;DR

BlockFFN introduces a novel mixture-of-experts architecture with chunk-level activation sparsity, enhancing acceleration and efficiency for large language models on resource-constrained devices.

Contribution

The paper proposes BlockFFN with differentiable routing and training objectives promoting both token-level and chunk-level sparsity, enabling efficient acceleration techniques.

Findings

01

Achieves over 80% token-level sparsity and 70% chunk-level sparsity.

02

Up to 3.67× speedup on end-side devices compared to dense models.

03

Demonstrates superior performance over other MoE baselines.

Abstract

To alleviate the computational burden of large language models (LLMs), architectures with activation sparsity, represented by mixture-of-experts (MoE), have attracted increasing attention. However, the non-differentiable and inflexible routing of vanilla MoE hurts model performance. Moreover, while each token activates only a few parameters, these sparsely-activated architectures exhibit low chunk-level sparsity, indicating that the union of multiple consecutive tokens activates a large ratio of parameters. Such a sparsity pattern is unfriendly for acceleration under low-resource conditions (e.g., end-side devices) and incompatible with mainstream acceleration techniques (e.g., speculative decoding). To address these challenges, we introduce a novel MoE architecture, BlockFFN, as well as its efficient training and deployment techniques. Specifically, we use a router integrating ReLU…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.