TL;DR
DECO introduces a sparse Mixture-of-Experts architecture that achieves dense-transformer performance on end-side devices with significantly reduced computational and storage requirements.
Contribution
The paper proposes DECO, a novel sparse MoE architecture with adaptive routing, a new activation function, and simplified expert design, matching dense model performance efficiently.
Findings
DECO activates only 20% of experts while maintaining dense performance.
DECO outperforms existing MoE baselines in experiments.
Achieves a 2.93× speedup on Jetson AGX Orin with specialized kernel.
Abstract
While Mixture-of-Experts (MoE) scales model capacity without proportionally increasing computation, its massive total parameter footprint creates significant storage and memory-access bottlenecks, which hinder efficient end-side deployment that simultaneously requires high performance, low computational cost, and small storage overhead. To achieve these properties, we present DECO, a sparse MoE architecture designed to match the performance of dense Transformers under identical total parameter budgets and training tokens. DECO utilizes the differentiable and flexible ReLU-based routing enhanced by learnable expert-wise scaling, which adaptively balances the contributions of routed and shared experts. Furthermore, we introduce NormSiLU, an activation function that normalizes inputs prior to SiLU operators, producing a more stable trend of routed-expert activation ratio and a higher…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
