Titanus: Enabling KV Cache Pruning and Quantization On-the-Fly for LLM Acceleration

Peilin Chen; Xiaoxuan Yang

arXiv:2505.17787·cs.AR·May 26, 2025

Titanus: Enabling KV Cache Pruning and Quantization On-the-Fly for LLM Acceleration

Peilin Chen, Xiaoxuan Yang

PDF

1 Repo

TL;DR

Titanus is a software-hardware co-designed system that enables on-the-fly pruning and quantization of KV caches in LLMs, significantly improving energy efficiency and throughput during inference.

Contribution

The paper introduces Titanus, a novel co-design approach with cascade pruning-quantization and hierarchical quantization to efficiently compress KV caches in LLMs.

Findings

01

Achieves 159.9x energy efficiency over Nvidia A100

02

Attains 49.6x energy efficiency and 34.8x throughput compared to FlightLLM

03

Reduces first token generation time with new pipeline and parallelism strategies

Abstract

Large language models (LLMs) have gained great success in various domains. Existing systems cache Key and Value within the attention block to avoid redundant computations. However, the size of key-value cache (KV cache) is unpredictable and can even be tens of times larger than the weights in the long context length scenario. In this work, we propose Titanus, a software-hardware co-design to efficiently compress the KV cache on-the-fly. We first propose the cascade pruning-quantization (CPQ) method to reduce the KV cache movement. The hierarchical quantization extension strategy is introduced to tackle the non-independent per-channel quantization issue. To further reduce KV cache movement, we transfer only the non-zero KV cache between the accelerator and off-chip memory. Moreover, we customize a two-stage design space exploration framework for the CPQ method. A novel pipeline and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

peilin-chen/titanus-for-llm-acceleration
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.