INT-FlashAttention: Enabling Flash Attention for INT8 Quantization

Shimao Chen; Zirui Liu; Zhiying Wu; Ce Zheng; Peizhuang Cong; Zihan; Jiang; Yuhan Wu; Lei Su; Tong Yang

arXiv:2409.16997·cs.LG·September 27, 2024

INT-FlashAttention: Enabling Flash Attention for INT8 Quantization

Shimao Chen, Zirui Liu, Zhiying Wu, Ce Zheng, Peizhuang Cong, Zihan, Jiang, Yuhan Wu, Lei Su, Tong Yang

PDF

Open Access 1 Repo

TL;DR

INT-FlashAttention integrates INT8 quantization with FlashAttention, significantly boosting inference speed and reducing quantization error for large language models on GPUs.

Contribution

This work introduces the first INT8 quantization architecture compatible with FlashAttention, enabling fully INT8 attention operators and improved inference performance.

Findings

01

72% faster inference speed on Ampere GPUs

02

82% smaller quantization error compared to FP16 and FP8

03

Compatible with various data formats like INT4

Abstract

As the foundation of large language models (LLMs), self-attention module faces the challenge of quadratic time and memory complexity with respect to sequence length. FlashAttention accelerates attention computation and reduces its memory usage by leveraging the GPU memory hierarchy. A promising research direction is to integrate FlashAttention with quantization methods. This paper introduces INT-FlashAttention, the first INT8 quantization architecture compatible with the forward workflow of FlashAttention, which significantly improves the inference speed of FlashAttention on Ampere GPUs. We implement our INT-FlashAttention prototype with fully INT8 activations and general matrix-multiplication (GEMM) kernels, making it the first attention operator with fully INT8 input. As a general token-level post-training quantization framework, INT-FlashAttention is also compatible with other data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

int-flashattention2024/int-flashattention
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing Techniques and Applications · Image and Signal Denoising Methods · Brain Tumor Detection and Classification

MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings