The Illusion of Equivalence: Systematic FP16 Divergence in KV-Cached Autoregressive Inference
Ranjith Chodavarapu, Lei Xu

TL;DR
This paper demonstrates that FP16 KV caching in autoregressive transformers causes deterministic divergence from cache-free inference due to non-associativity, challenging the assumption of equivalence.
Contribution
It reveals systematic FP16 divergence in KV cache inference, provides empirical evidence across models, and identifies the causal role of FP16 non-associativity.
Findings
FP16 cache inference diverges systematically from cache-free computation.
Controlled FP32 tests eliminate divergence, confirming FP16 non-associativity as the cause.
Architectural patterns influence divergence propagation across layers.
Abstract
KV caching is a ubiquitous optimization in autoregressive transformer inference, long presumed to be numerically equivalent to cache-free computation. This assumption fails under standard FP16 precision: cache-ON and cache-OFF execution paths employ different floating-point accumulation orderings which, due to FP16 non-associativity, produce a deterministic divergence in decoded token sequences. Across three open-weight models (LLaMA-2-7B, Mistral-7B-v0.3, Gemma-2-2B) evaluated on GSM8K, we observe a 100\% token divergence rate across all sampling strategies, including greedy decoding, which rules out sampling randomness as a cause, and also with cache-ON yielding higher accuracy in 8 of 9 conditions, where the accuracy difference serves as an indicator that the divergence direction is systematic rather than random. Controlled FP32 falsification reduces divergence by eight orders of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
