SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining

Yifan Zhang; Zunhai Su; Shuhao Hu; Rui Yang; Wei Wu; Yulei Qian; Yuchen Xie; Xunliang Cai

arXiv:2602.10718·cs.LG·April 29, 2026

SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining

Yifan Zhang, Zunhai Su, Shuhao Hu, Rui Yang, Wei Wu, Yulei Qian, Yuchen Xie, Xunliang Cai

PDF

1 Repo

TL;DR

SnapMLA introduces a hardware-aware FP8 decoding framework for MLA architectures, significantly enhancing long-context decoding efficiency while maintaining high-quality outputs.

Contribution

It presents novel FP8 quantization and pipeline techniques tailored for MLA decoding, enabling up to 1.91x throughput improvements with minimal quality loss.

Findings

01

Achieved up to 1.91x throughput improvement in long-context decoding.

02

Maintained near-parity benchmark quality with BF16 baseline.

03

Developed hardware-aware quantization and dataflow optimization techniques.

Abstract

While FP8 attention has shown substantial promise in innovations like FlashAttention-3, its integration into the decoding phase of the DeepSeek Multi-head Latent Attention (MLA) architecture presents notable challenges. These challenges include numerical heterogeneity arising from the decoupling of positional embeddings, misalignment of quantization scales in FP8 PV GEMM, and the need for optimized system-level support. In this paper, we introduce SnapMLA, an FP8 MLA decoding framework optimized to improve long-context efficiency through the following hardware-aware algorithm-kernel co-optimization techniques: (i) RoPE-Aware Per-Token KV Quantization: Motivated by our analysis of the heterogeneous quantization sensitivity inherent to the MLA KV cache, this approach preserves the RoPE part in high precision. Furthermore, per-token granularity is employed to align with the autoregressive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

meituan-longcat/SGLang-FluentLLM
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.