RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts
Vyom Sharma, Debajyoti Datta

TL;DR
RaMP is a routing-aware dispatch framework that optimizes Mixture-of-Experts inference by predicting the best kernel configuration based on runtime analysis, significantly improving throughput and speed.
Contribution
Introduces RaMP, a performance-region analysis and wave cost model that predicts optimal kernel configurations for MoE inference, enabling kernel-agnostic, runtime-aware dispatch.
Findings
RaMP achieves 0.93% mean regret compared to exhaustive search.
Applying RaMP to Alpha-MoE yields 1.14x speedup without source modifications.
RaMP improves end-to-end serving speed by up to 1.41x over existing methods.
Abstract
The optimal kernel configuration for Mixture-of-Experts (MoE) inference depends on both batch size and the expert routing distribution, yet production systems dispatch from batch size alone, leaving 10-70% of kernel throughput unrealized. We present RaMP, a routing-aware dispatch framework. A performance-region analysis derives, from hardware constants alone, when each optimization helps, correctly predicting all 8 tested architectures, including 3 unseen. A four-parameter wave cost model selects the fastest configuration from the runtime expert histogram, achieving 0.93% mean regret versus exhaustive search, fitted from just 10-24 minutes of one-time profiling per model. Because the model depends only on CTA grid geometry, it is kernel-agnostic: applied to Alpha-MoE, it delivers 1.14x with no source modification. Paired with a co-designed CuTe DSL kernel exposing 134-268 polymorphic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
