TriAxialKV: Toward Extreme Low-Precision KV-Cache Quantization for Agentic Inference Tasks

Hanzhang Shen; Haoran Wu; Yiren Zhao; Robert Mullins

arXiv:2605.17170·cs.LG·May 19, 2026

TriAxialKV: Toward Extreme Low-Precision KV-Cache Quantization for Agentic Inference Tasks

Hanzhang Shen, Haoran Wu, Yiren Zhao, Robert Mullins

PDF

TL;DR

This paper introduces TriAxialKV, a novel mixed-precision KV-cache quantization method that considers three axes of token importance, enabling efficient low-precision inference for agentic workloads in large language models.

Contribution

TriAxialKV uniquely assigns triaxial tags to tokens, calibrates sensitivity, and allocates mixed-precision bitwidths, improving efficiency and throughput in LLM inference systems.

Findings

01

Achieves 4.5× larger KV cache size with maintained accuracy.

02

Supports 30% higher end-to-end throughput on GPU systems.

03

Matches BF16 KV cache accuracy using INT2/INT4 quantization.

Abstract

Agentic workloads have emerged as a major workload for LLM inference. They differ significantly from chat-only workloads, requiring long-context processing, the ability to handle multimodal inputs, and structured multi-turn interactions with tool calling capabilities. As a result, their context exhibits structure that can carry different importance along three key axes: temporal recency to the current turn, modality such as text or image tokens, and semantic role such as user queries, tool calls, observations, or reasoning. These axes capture distinct token behaviors and lead to different sensitivities to KV-cache compression. However, existing KV-cache quantization methods are typically homogeneous or exploit only heterogeneity on a single dimension, such as temporal proximity or modality, overlooking the interactions among them. To this end, we introduce TriAxialKV, a novel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.