TriAxialKV: Toward Extreme Low-Precision KV-Cache Quantization for Agentic Inference Tasks
Hanzhang Shen, Haoran Wu, Yiren Zhao, Robert Mullins

TL;DR
This paper introduces TriAxialKV, a novel mixed-precision KV-cache quantization method that considers three axes of token importance, enabling efficient low-precision inference for agentic workloads in large language models.
Contribution
TriAxialKV uniquely assigns triaxial tags to tokens, calibrates sensitivity, and allocates mixed-precision bitwidths, improving efficiency and throughput in LLM inference systems.
Findings
Achieves 4.5× larger KV cache size with maintained accuracy.
Supports 30% higher end-to-end throughput on GPU systems.
Matches BF16 KV cache accuracy using INT2/INT4 quantization.
Abstract
Agentic workloads have emerged as a major workload for LLM inference. They differ significantly from chat-only workloads, requiring long-context processing, the ability to handle multimodal inputs, and structured multi-turn interactions with tool calling capabilities. As a result, their context exhibits structure that can carry different importance along three key axes: temporal recency to the current turn, modality such as text or image tokens, and semantic role such as user queries, tool calls, observations, or reasoning. These axes capture distinct token behaviors and lead to different sensitivities to KV-cache compression. However, existing KV-cache quantization methods are typically homogeneous or exploit only heterogeneity on a single dimension, such as temporal proximity or modality, overlooking the interactions among them. To this end, we introduce TriAxialKV, a novel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
