
TL;DR
Excitation is a novel optimizer framework that dynamically modulates updates based on expert utilization in sparse models, improving training stability and performance in Mixture-of-Experts architectures.
Contribution
It introduces a new dynamic update modulation method that enhances training stability and performance in sparse, expert-based models, addressing structural confusion issues.
Findings
Accelerates convergence in language and vision MoE models
Rescues models from structural confusion during training
Consistently improves final performance across tasks
Abstract
We propose Excitation, a novel optimization framework designed to accelerate learning in sparse architectures such as Mixture-of-Experts (MoEs). Unlike traditional optimizers that treat all parameters uniformly, Excitation dynamically modulates updates using batch-level expert utilization. It introduces a competitive update dynamic that amplifies updates to highly-utilized experts and can selectively suppress low-utilization ones, effectively sharpening routing specialization. Notably, we identify a phenomenon of "structural confusion" in deep MoEs, where standard optimizers fail to establish functional signal paths; Excitation acts as a specialization catalyst, "rescuing" these models and enabling stable training where baselines remain trapped. Excitation is optimizer-, domain-, and model-agnostic, requires minimal integration effort, and introduces neither additional per-parameter…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques · Multimodal Machine Learning Applications
