FireQ: Fast INT4-FP8 Kernel and RoPE-aware Quantization for LLM Inference Acceleration

Daehyeon Baek; Jieun Choi; Jimyoung Son; Kyungmin Bin; Seungbeom Choi; Kihyo Moon; Minsung Jang; Hyojung Lee

arXiv:2505.20839·cs.LG·July 21, 2025

FireQ: Fast INT4-FP8 Kernel and RoPE-aware Quantization for LLM Inference Acceleration

Daehyeon Baek, Jieun Choi, Jimyoung Son, Kyungmin Bin, Seungbeom Choi, Kihyo Moon, Minsung Jang, Hyojung Lee

PDF

Open Access

TL;DR

FireQ introduces a co-designed INT4-FP8 quantization framework and optimized kernels that significantly accelerate large language model inference while maintaining accuracy, addressing memory bandwidth constraints effectively.

Contribution

The paper presents FireQ, a novel PTQ framework with a specialized INT4-FP8 kernel and techniques for RoPE-aware quantization, improving inference speed and efficiency for LLMs.

Findings

01

Achieves 1.68x faster FNN inference on Llama2-7B.

02

Attains 1.26x faster prefill on Llama3-8B.

03

Maintains negligible accuracy loss.

Abstract

As large language models become increasingly prevalent, memory bandwidth constraints significantly limit inference throughput, motivating post-training quantization (PTQ). In this paper, we propose FireQ, a co-designed PTQ framework and an INT4-FP8 matrix multiplication kernel that accelerates LLM inference across all linear layers. Specifically, FireQ quantizes linear layer weights and key-values to INT4, and activations and queries to FP8, significantly enhancing throughput. Additionally, we introduce a three-stage pipelining for the prefill phase, which modifies the FlashAttention-3 kernel, effectively reducing time-to-first-token in the prefill phase. To minimize accuracy loss from quantization, we develop novel outlier smoothing techniques tailored separately for linear and attention layers. In linear layers, we explicitly use per-tensor scaling to prevent underflow caused by the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning