INT-FlashAttention: Enabling Flash Attention for INT8 Quantization
Shimao Chen, Zirui Liu, Zhiying Wu, Ce Zheng, Peizhuang Cong, Zihan, Jiang, Yuhan Wu, Lei Su, Tong Yang

TL;DR
INT-FlashAttention integrates INT8 quantization with FlashAttention, significantly boosting inference speed and reducing quantization error for large language models on GPUs.
Contribution
This work introduces the first INT8 quantization architecture compatible with FlashAttention, enabling fully INT8 attention operators and improved inference performance.
Findings
72% faster inference speed on Ampere GPUs
82% smaller quantization error compared to FP16 and FP8
Compatible with various data formats like INT4
Abstract
As the foundation of large language models (LLMs), self-attention module faces the challenge of quadratic time and memory complexity with respect to sequence length. FlashAttention accelerates attention computation and reduces its memory usage by leveraging the GPU memory hierarchy. A promising research direction is to integrate FlashAttention with quantization methods. This paper introduces INT-FlashAttention, the first INT8 quantization architecture compatible with the forward workflow of FlashAttention, which significantly improves the inference speed of FlashAttention on Ampere GPUs. We implement our INT-FlashAttention prototype with fully INT8 activations and general matrix-multiplication (GEMM) kernels, making it the first attention operator with fully INT8 input. As a general token-level post-training quantization framework, INT-FlashAttention is also compatible with other data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing Techniques and Applications · Image and Signal Denoising Methods · Brain Tumor Detection and Classification
MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
