TL;DR
SnapMLA introduces a hardware-aware FP8 decoding framework for MLA architectures, significantly enhancing long-context decoding efficiency while maintaining high-quality outputs.
Contribution
It presents novel FP8 quantization and pipeline techniques tailored for MLA decoding, enabling up to 1.91x throughput improvements with minimal quality loss.
Findings
Achieved up to 1.91x throughput improvement in long-context decoding.
Maintained near-parity benchmark quality with BF16 baseline.
Developed hardware-aware quantization and dataflow optimization techniques.
Abstract
While FP8 attention has shown substantial promise in innovations like FlashAttention-3, its integration into the decoding phase of the DeepSeek Multi-head Latent Attention (MLA) architecture presents notable challenges. These challenges include numerical heterogeneity arising from the decoupling of positional embeddings, misalignment of quantization scales in FP8 PV GEMM, and the need for optimized system-level support. In this paper, we introduce SnapMLA, an FP8 MLA decoding framework optimized to improve long-context efficiency through the following hardware-aware algorithm-kernel co-optimization techniques: (i) RoPE-Aware Per-Token KV Quantization: Motivated by our analysis of the heterogeneous quantization sensitivity inherent to the MLA KV cache, this approach preserves the RoPE part in high precision. Furthermore, per-token granularity is employed to align with the autoregressive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
