QiMeng-Attention: SOTA Attention Operator is generated by SOTA Attention Algorithm

Qirui Zhou; Shaohui Peng; Weiqiang Xiong; Haixin Chen; Yuanbo Wen; Haochen Li; Ling Li; Qi Guo; Yongwei Zhao; Ke Gao; Ruizhi Chen; Yanjun Wu; Chen Zhao; Yunji Chen

arXiv:2506.12355·cs.LG·June 17, 2025

QiMeng-Attention: SOTA Attention Operator is generated by SOTA Attention Algorithm

Qirui Zhou, Shaohui Peng, Weiqiang Xiong, Haixin Chen, Yuanbo Wen, Haochen Li, Ling Li, Qi Guo, Yongwei Zhao, Ke Gao, Ruizhi Chen, Yanjun Wu, Chen Zhao, Yunji Chen

PDF

Open Access

TL;DR

This paper introduces a novel LLM-friendly language and workflow that enables automatic generation of high-performance attention operators, significantly improving speed and hardware compatibility in large language models.

Contribution

It proposes LLM-TL and a two-stage reasoning workflow for automatic, hardware-agnostic attention operator generation, surpassing existing manual and library-based methods.

Findings

01

Achieves up to 35.16x speed-up on various GPUs.

02

Outperforms human-optimized libraries in most scenarios.

03

Reduces development time from months to minutes.

Abstract

The attention operator remains a critical performance bottleneck in large language models (LLMs), particularly for long-context scenarios. While FlashAttention is the most widely used and effective GPU-aware acceleration algorithm, it must require time-consuming and hardware-specific manual implementation, limiting adaptability across GPU architectures. Existing LLMs have shown a lot of promise in code generation tasks, but struggle to generate high-performance attention code. The key challenge is it cannot comprehend the complex data flow and computation process of the attention operator and utilize low-level primitive to exploit GPU performance. To address the above challenge, we propose an LLM-friendly Thinking Language (LLM-TL) to help LLMs decouple the generation of high-level optimization logic and low-level implementation on GPU, and enhance LLMs' understanding of attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBig Data and Digital Economy · Parallel Computing and Optimization Techniques · Multimodal Machine Learning Applications