LLaMCAT: Optimizing Large Language Model Inference with Cache Arbitration and Throttling
Zhongchun Zhou, Chengtao Lai, Wei Zhang

TL;DR
LLaMCAT introduces a cache arbitration and thread throttling approach to optimize large language model inference, significantly improving speedup by reducing cache stalls and handling bandwidth demands on GPU and accelerator architectures.
Contribution
It presents the first targeted solution for MSHR contention in LLM decoding, combining cache arbitration, thread throttling, and a hybrid simulation framework for efficient inference optimization.
Findings
Achieves 1.26x speedup in miss throughput bottleneck scenarios.
Attains 1.58x speedup with limited cache size over unoptimized systems.
Outperforms baseline methods like dyncta in cache optimization for LLM inference.
Abstract
Large Language Models (LLMs) have achieved unprecedented success across various applications, but their substantial memory requirements pose significant challenges to current memory system designs, especially during inference. Our work targets last-level cache (LLC) based architectures, including GPUs (e.g., NVIDIA GPUs) and AI accelerators. We introduce LLaMCAT, a novel approach to optimize the LLC for LLM inference. LLaMCAT combines Miss Status Holding Register (MSHR)- and load balance-aware cache arbitration with thread throttling to address stringent bandwidth demands and minimize cache stalls in KV Cache access. We also propose a hybrid simulation framework integrating analytical models with cycle-level simulators via memory traces, balancing architecture detail and efficiency. Experiments demonstrate that LLaMCAT achieves an average speedup of 1.26x when the system is mainly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Big Data and Digital Economy · Advanced Neural Network Applications
