FlashMLA-ETAP: Efficient Transpose Attention Pipeline for Accelerating MLA Inference on NVIDIA H20 GPUs

Pengcuo Dege; Qiuming Luo; Rui Mao; Chang Kong

arXiv:2506.01969·cs.DC·June 5, 2025

FlashMLA-ETAP: Efficient Transpose Attention Pipeline for Accelerating MLA Inference on NVIDIA H20 GPUs

Pengcuo Dege, Qiuming Luo, Rui Mao, Chang Kong

PDF

Open Access 1 Repo

TL;DR

FlashMLA-ETAP introduces a novel transpose attention pipeline that significantly accelerates MLA inference on NVIDIA H20 GPUs, reducing redundant computations and improving speed while maintaining numerical stability.

Contribution

The paper presents ETAP, a new reconfiguration of attention computation that enhances MLA inference efficiency on single GPUs, with theoretical analysis and practical integration.

Findings

01

Achieves 2.78x speedup over FlashMLA at 64K sequence length

02

Reduces RMSE by 15.2x compared to FlashAttention-3

03

Supports seamless integration into existing frameworks

Abstract

Efficient inference of Multi-Head Latent Attention (MLA) is challenged by deploying the DeepSeek-R1 671B model on a single Multi-GPU server. This paper introduces FlashMLA-ETAP, a novel framework that enhances MLA inference for the single-instance deployment scenario on NVIDIA H20 GPUs. We propose the Efficient Transpose Attention Pipeline (ETAP), which reconfigures attention computation through transposition to align the KV context length with the \(M\)-dimension in WGMMA operations, significantly reducing redundant computations. FlashMLA-ETAP achieves a 2.78x speedup over FlashMLA at 64K sequence length (batch size 16), with 5.24x and 4.94x improvements over FlashAttention-3 and FlashInfer, respectively, while maintaining numerical stability with a 15.2x lower RMSE (\(1.25 \times 10^{-5}\)) than FlashAttention-3. Furthermore, ETAP's design enables seamless integration into frameworks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pengcuo/flashmla-etap
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Big Data and Digital Economy

MethodsSoftmax · Attention Is All You Need · ALIGN