Runtime-Certified Bounded-Error Quantized Attention

Dean Calver

arXiv:2605.20868·cs.LG·May 21, 2026

Runtime-Certified Bounded-Error Quantized Attention

Dean Calver

PDF

TL;DR

This paper introduces a runtime-certified attention mechanism for quantized KV caches in large language models, ensuring bounded approximation errors and reliable fallback to exact attention outputs during inference.

Contribution

It presents a tiered architecture with online error bounds and adaptive fallback, enabling safe, high-compression attention with guarantees against catastrophic failures.

Findings

01

Matches dense FP16 KV quality within noise for language modeling

02

Recovers from catastrophic failures in naive INT8/INT4 baselines

03

Provides local, per-head, per-step error certification

Abstract

KV cache quantization reduces the memory cost of long-context LLM inference, but introduces approximation error that is typically validated only empirically. Existing systems rely on average-case robustness, with no mechanism to detect or recover from failures at runtime. We present a tiered KV cache architecture that enables runtime-certified attention: INT8 keys and INT4 values are stored in GPU memory, while FP16 originals are retained in system RAM for deterministic fallback. A two-term error decomposition yields per-head, per-step bounds on (i) attention distribution distortion from key quantization and (ii) value reconstruction error. These bounds are computed online and used to drive adaptive precision selection and a multi-stage fallback ladder, which guarantees recovery to the exact dense attention output when required. Across PG-19, NIAH, and RULER benchmarks on LLaMA~3.1-8B…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.