When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon
Mohamed Amine Bergach

TL;DR
This paper demonstrates that a specialized int4 KV cache kernel on Apple Silicon can outperform fp16 in speed while maintaining quality, enabling efficient large-scale language model inference.
Contribution
It introduces a fused Metal kernel for int4 KV-cache that surpasses fp16 speed on Apple Silicon, with significant memory compression and preserved model quality.
Findings
Int4 kernel runs faster than fp16 across various token lengths.
Memory compression by 3x reduces per-token cost without quality loss.
The kernel mitigates the 4-bit token catastrophe in Qwen models.
Abstract
KV-cache quantization is framed as a quality--latency trade-off. We show it is \emph{inverted} on Apple Silicon's unified memory: a single fused Metal kernel (sign-randomized FFT per-channel per-group abs-max int4 nibble pack), exposed as a HuggingFace \texttt{Cache} subclass, runs \emph{faster than fp16} across ---token prefixes on Gemma-3 1B ( to ms/tok) and at short context on Qwen2.5-1.5B ( to through K), with persistent memory compression and quality preserved ( Qwen short-prompt; hook Gemma). The kernel's \,ns/vec overhead is below the bandwidth savings from compression. The fused kernel also closes Qwen's 4-bit per-token catastrophe (, reduction) at \,GFLOPS / . Supporting findings: and are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
