RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts

Vyom Sharma; Debajyoti Datta

arXiv:2604.26039·cs.LG·April 30, 2026

RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts

Vyom Sharma, Debajyoti Datta

PDF

TL;DR

RaMP is a routing-aware dispatch framework that optimizes Mixture-of-Experts inference by predicting the best kernel configuration based on runtime analysis, significantly improving throughput and speed.

Contribution

Introduces RaMP, a performance-region analysis and wave cost model that predicts optimal kernel configurations for MoE inference, enabling kernel-agnostic, runtime-aware dispatch.

Findings

01

RaMP achieves 0.93% mean regret compared to exhaustive search.

02

Applying RaMP to Alpha-MoE yields 1.14x speedup without source modifications.

03

RaMP improves end-to-end serving speed by up to 1.41x over existing methods.

Abstract

The optimal kernel configuration for Mixture-of-Experts (MoE) inference depends on both batch size and the expert routing distribution, yet production systems dispatch from batch size alone, leaving 10-70% of kernel throughput unrealized. We present RaMP, a routing-aware dispatch framework. A performance-region analysis derives, from hardware constants alone, when each optimization helps, correctly predicting all 8 tested architectures, including 3 unseen. A four-parameter wave cost model selects the fastest configuration from the runtime expert histogram, achieving 0.93% mean regret versus exhaustive search, fitted from just 10-24 minutes of one-time profiling per model. Because the model depends only on CTA grid geometry, it is kernel-agnostic: applied to Alpha-MoE, it delivers 1.14x with no source modification. Paired with a co-designed CuTe DSL kernel exposing 134-268 polymorphic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.