When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon

Mohamed Amine Bergach

arXiv:2605.05699·cs.PF·May 8, 2026

When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon

Mohamed Amine Bergach

PDF

TL;DR

This paper demonstrates that a specialized int4 KV cache kernel on Apple Silicon can outperform fp16 in speed while maintaining quality, enabling efficient large-scale language model inference.

Contribution

It introduces a fused Metal kernel for int4 KV-cache that surpasses fp16 speed on Apple Silicon, with significant memory compression and preserved model quality.

Findings

01

Int4 kernel runs faster than fp16 across various token lengths.

02

Memory compression by 3x reduces per-token cost without quality loss.

03

The kernel mitigates the 4-bit token catastrophe in Qwen models.

Abstract

KV-cache quantization is framed as a quality--latency trade-off. We show it is \emph{inverted} on Apple Silicon's unified memory: a single fused Metal kernel (sign-randomized FFT $+$ per-channel $λ$ $+$ per-group abs-max $+$ int4 nibble pack), exposed as a HuggingFace \texttt{Cache} subclass, runs \emph{faster than fp16} across $256$ -- $4096$ -token prefixes on Gemma-3 1B ( $- 3$ to $- 8%$ ms/tok) and at short context on Qwen2.5-1.5B ( $- 0.7$ to $- 2.6%$ through $1$ K), with $3 \times$ persistent memory compression and quality preserved ( $\dPPL = 0.000$ Qwen short-prompt; $+ 3.6$ hook $\dPPL$ Gemma). The kernel's $\sim 25$ \,ns/vec overhead is below the bandwidth savings from $3 \times$ compression. The fused kernel also closes Qwen's 4-bit per-token catastrophe ( $\dPPL = + 7975 \to + 638.6$ , $12.5 \times$ reduction) at $182$ \,GFLOPS / $D = 128$ . Supporting findings: $\SRFT$ and $\SRHT$ are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.