Runtime-Certified Bounded-Error Quantized Attention
Dean Calver

TL;DR
This paper introduces a runtime-certified attention mechanism for quantized KV caches in large language models, ensuring bounded approximation errors and reliable fallback to exact attention outputs during inference.
Contribution
It presents a tiered architecture with online error bounds and adaptive fallback, enabling safe, high-compression attention with guarantees against catastrophic failures.
Findings
Matches dense FP16 KV quality within noise for language modeling
Recovers from catastrophic failures in naive INT8/INT4 baselines
Provides local, per-head, per-step error certification
Abstract
KV cache quantization reduces the memory cost of long-context LLM inference, but introduces approximation error that is typically validated only empirically. Existing systems rely on average-case robustness, with no mechanism to detect or recover from failures at runtime. We present a tiered KV cache architecture that enables runtime-certified attention: INT8 keys and INT4 values are stored in GPU memory, while FP16 originals are retained in system RAM for deterministic fallback. A two-term error decomposition yields per-head, per-step bounds on (i) attention distribution distortion from key quantization and (ii) value reconstruction error. These bounds are computed online and used to drive adaptive precision selection and a multi-stage fallback ladder, which guarantees recovery to the exact dense attention output when required. Across PG-19, NIAH, and RULER benchmarks on LLaMA~3.1-8B…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
