Ada-MK: Adaptive MegaKernel Optimization via Automated DAG-based Search for LLM Inference

Wenxin Dong; Mingqing Hu; Guanghui Yu; Qiang Fu; Peng Xu; Hui Xu; Yue Xing; Xuewu Jiao; Shuanglong Li; Lin Liu

arXiv:2605.11581·cs.CL·May 13, 2026

Ada-MK: Adaptive MegaKernel Optimization via Automated DAG-based Search for LLM Inference

Wenxin Dong, Mingqing Hu, Guanghui Yu, Qiang Fu, Peng Xu, Hui Xu, Yue Xing, Xuewu Jiao, Shuanglong Li, Lin Liu

PDF

TL;DR

Ada-MK introduces an optimized MegaKernel approach for LLM inference that reduces latency and improves throughput by eliminating runtime branching and optimizing kernel execution paths.

Contribution

It presents a novel compile-time optimization framework for MegaKernel that enhances portability and efficiency on resource-constrained GPUs, enabling industrial-scale deployment.

Findings

01

Up to 23.6% throughput improvement over vanilla TensorRT-LLM.

02

50.2% throughput increase over vLLM.

03

First industrial deployment of MegaKernel in a commercial system.

Abstract

When large language models (LLMs) serve real-time inference in commercial online advertising systems, end-to-end latency must be strictly bounded to the millisecond range. Yet every token generated during the decode phase triggers thousands of kernel launches, and kernel launch overhead alone can account for 14.6% of end-to-end inference time. MegaKernel eliminates launch overhead and inter-operator HBM round-trips by fusing multiple operators into a single persistent kernel. However, existing MegaKernel implementations face a fundamental tension between portability and efficiency on resource-constrained GPUs such as NVIDIA Ada: hand-tuned solutions are tightly coupled to specific architectures and lack portability, while auto-compiled approaches introduce runtime dynamic scheduling whose branch penalties are unacceptable in latency-critical settings. We observe that under a fixed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.