TL;DR
This paper introduces MoHGE, a resource-efficient mixture of heterogeneous grouped experts for language modeling, balancing performance, parameter efficiency, and GPU utilization.
Contribution
It proposes a novel two-level routing and auxiliary loss mechanisms to improve resource efficiency and load balancing in heterogeneous MoE architectures.
Findings
MoHGE matches MoE performance with 20% fewer parameters.
It achieves balanced GPU utilization during inference.
The approach reduces deployment costs in real-world scenarios.
Abstract
Large Language Models (LLMs) based on Mixture-of-Experts (MoE) are pivotal in industrial applications for their ability to scale performance efficiently. However, standard MoEs enforce uniform expert sizes,creating a rigidity that fails to align computational costs with varying token-level complexity. While heterogeneous expert architectures attempt to address this by diversifying expert sizes, they often suffer from significant system-level challenges, specifically unbalanced GPU utilization and inefficient parameter utilization, which hinder practical deployment. To bridge the gap between theoretical heterogeneity and robust industrial application, we propose Mixture of Heterogeneous Grouped Experts (MoHGE) which introduces a two-level routing mechanism to enable flexible, resource-aware expert combinations. To optimize inference efficiency, we propose a Group-Wise Auxiliary Loss,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
