MultiKernelBench: A Multi-Platform Benchmark for Kernel Generation
Zhongzhen Wen, Yinghui Zhang, Zhong Li, Zhongxin Liu, Linna Xie, Tian Zhang

TL;DR
MultiKernelBench is a comprehensive multi-platform benchmark for evaluating large language models in deep learning kernel generation, covering diverse tasks and hardware, with a novel prompting method to improve quality.
Contribution
It introduces the first multi-platform benchmark for LLM-based DL kernel generation, with extensive task coverage, modular design, and a category-aware prompting strategy.
Findings
Significant variation in task difficulty across LLMs.
Poor generalization to less-exposed hardware platforms.
Targeted prompting improves kernel generation quality.
Abstract
The automatic generation of deep learning (DL) kernels using large language models (LLMs) has emerged as a promising approach to reduce the manual effort and hardware-specific expertise required for writing high-performance operator implementations. However, existing benchmarks for evaluating LLMs in this domain suffer from limited hardware support, coarse-grained kernel categorization, and imbalanced task coverage. To address these limitations, we introduce MultiKernelBench, the first comprehensive, multi-platform benchmark for LLM-based DL kernel generation. MultiKernelBench spans 285 tasks across 14 well-defined kernel categories and supports three major hardware platforms: Nvidia GPUs, Huawei NPUs, and Google TPUs. To enable future extensibility, we design a modular backend abstraction layer that decouples platform-specific logic from the core benchmarking infrastructure, allowing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
