Ada-K Routing: Boosting the Efficiency of MoE-based LLMs
Tongtian Yue, Longteng Guo, Jie Cheng, Xuange Gao, Jing, Liu

TL;DR
This paper introduces Ada-K routing, a dynamic expert activation method for MoE-based LLMs that improves efficiency and performance by adaptively allocating experts per token using learnable modules and reinforcement learning.
Contribution
We propose a novel Ada-K routing strategy with learnable allocators and PPO training, enabling dynamic expert activation in MoE-LLMs, outperforming static Top-K routing in efficiency and accuracy.
Findings
Over 25% reduction in FLOPs compared to Top-K routing.
More than 20% inference speedup while maintaining or improving performance.
Efficient training of large-scale MoE models within 8 hours.
Abstract
In the era of Large Language Models (LLMs), Mixture-of-Experts (MoE) architectures offer a promising approach to managing computational costs while scaling up model parameters. Conventional MoE-based LLMs typically employ static Top-K routing, which activates a fixed and equal number of experts for each token regardless of their significance within the context. In this paper, we propose a novel Ada-K routing strategy that dynamically adjusts the number of activated experts for each token, thereby improving the balance between computational efficiency and model performance. Specifically, our strategy incorporates learnable and lightweight allocator modules that decide customized expert resource allocation tailored to the contextual needs for each token. These allocators are designed to be fully pluggable, making it broadly applicable across all mainstream MoE-based LLMs. We leverage the…
Peer Reviews
Decision·ICLR 2025 Poster
MoE is a scalable solution that balances parameter increase with computational cost. Targeting the limitations of prior efforts with a fixed number of experts, this paper works to make the expert number dynamic, which can bring additional efficiency and potential performance gains. The experiments include studies across multiple scales of models to test the effectiveness of the proposed method. An advantage of Ada-K is that it is pluggable, making it applicable across different MoE-based LLMs.
1. The Ada-k routing design works for the post-training of MoE-based LLMs. For post-training, as the gate is already trained, a question is whether it is necessary to use RL to learn the gate selection again, which complicates the overall design. For instance, a simple solution is to distill Top-k selection into binary selection with a separate gate, similar to the MoD design. The paper doesn't include comparison studies with this naive Top-k selection distillation, making it hard to say whether
- Innovatively proposes to introduce an RL agent for controlling top-k in pre-trained MoE models, optimizing the allocation of computational resources and improving inference efficiency. The method has a low training cost and shows robustness to training data, outperforming existing methods across various downstream tasks. - Thorough ablation studies demonstrate the relationship between acceleration effects and accuracy, providing practical guidance. It also validates the effectiveness of activa
- Some comparisons with baseline methods are not entirely reasonable. Existing work suggests that freezing the router yields better results when tuning MoE models. Ada-K freezes the router and introduces additional agent parameters for training, while other comparison methods only train the router. The reviewer recommends supplementing the results by (1) conducting full fine-tuning of the model with a frozen router and comparing the effects of introducing only threshold methods versus Ada-K; or
1) Well-motivated. It is well-known that MoE LLMs are very effective and promising, but the efficiency of MoE LLM deployment is limited due to the huge amount of trainable parameters. It is good to improve the efficiency of LLMs. 2) Clear writing and comprehensive ablation studies.
1) An important baseline is missing -> Mixture of Depth (https://arxiv.org/abs/2404.02258). Due to the layer skip in this paper, the computation cost for each token is adaptive as well, 2) Due to the imbalanced computation cost in different layers, the pipeline parallelism is more difficult and challenging to use, during both training and inference. 3) There are many other ways to introduce adaptive computation budget, e.g. ACT algorithm in universal transformer (https://arxiv.org/abs/1807.03819
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCaching and Content Delivery · Mobile Agent-Based Network Management · Service-Oriented Architecture and Web Services
MethodsMixture of Experts
