Ada-MK: Adaptive MegaKernel Optimization via Automated DAG-based Search for LLM Inference
Wenxin Dong, Mingqing Hu, Guanghui Yu, Qiang Fu, Peng Xu, Hui Xu, Yue Xing, Xuewu Jiao, Shuanglong Li, Lin Liu

TL;DR
Ada-MK introduces an optimized MegaKernel approach for LLM inference that reduces latency and improves throughput by eliminating runtime branching and optimizing kernel execution paths.
Contribution
It presents a novel compile-time optimization framework for MegaKernel that enhances portability and efficiency on resource-constrained GPUs, enabling industrial-scale deployment.
Findings
Up to 23.6% throughput improvement over vanilla TensorRT-LLM.
50.2% throughput increase over vLLM.
First industrial deployment of MegaKernel in a commercial system.
Abstract
When large language models (LLMs) serve real-time inference in commercial online advertising systems, end-to-end latency must be strictly bounded to the millisecond range. Yet every token generated during the decode phase triggers thousands of kernel launches, and kernel launch overhead alone can account for 14.6% of end-to-end inference time. MegaKernel eliminates launch overhead and inter-operator HBM round-trips by fusing multiple operators into a single persistent kernel. However, existing MegaKernel implementations face a fundamental tension between portability and efficiency on resource-constrained GPUs such as NVIDIA Ada: hand-tuned solutions are tightly coupled to specific architectures and lack portability, while auto-compiled approaches introduce runtime dynamic scheduling whose branch penalties are unacceptable in latency-critical settings. We observe that under a fixed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
