TL;DR
Titanus is a software-hardware co-designed system that enables on-the-fly pruning and quantization of KV caches in LLMs, significantly improving energy efficiency and throughput during inference.
Contribution
The paper introduces Titanus, a novel co-design approach with cascade pruning-quantization and hierarchical quantization to efficiently compress KV caches in LLMs.
Findings
Achieves 159.9x energy efficiency over Nvidia A100
Attains 49.6x energy efficiency and 34.8x throughput compared to FlightLLM
Reduces first token generation time with new pipeline and parallelism strategies
Abstract
Large language models (LLMs) have gained great success in various domains. Existing systems cache Key and Value within the attention block to avoid redundant computations. However, the size of key-value cache (KV cache) is unpredictable and can even be tens of times larger than the weights in the long context length scenario. In this work, we propose Titanus, a software-hardware co-design to efficiently compress the KV cache on-the-fly. We first propose the cascade pruning-quantization (CPQ) method to reduce the KV cache movement. The hierarchical quantization extension strategy is introduced to tackle the non-independent per-channel quantization issue. To further reduce KV cache movement, we transfer only the non-zero KV cache between the accelerator and off-chip memory. Moreover, we customize a two-stage design space exploration framework for the CPQ method. A novel pipeline and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
