LLaMCAT: Optimizing Large Language Model Inference with Cache Arbitration and Throttling

Zhongchun Zhou; Chengtao Lai; Wei Zhang

arXiv:2512.00083·cs.AR·December 2, 2025

LLaMCAT: Optimizing Large Language Model Inference with Cache Arbitration and Throttling

Zhongchun Zhou, Chengtao Lai, Wei Zhang

PDF

Open Access

TL;DR

LLaMCAT introduces a cache arbitration and thread throttling approach to optimize large language model inference, significantly improving speedup by reducing cache stalls and handling bandwidth demands on GPU and accelerator architectures.

Contribution

It presents the first targeted solution for MSHR contention in LLM decoding, combining cache arbitration, thread throttling, and a hybrid simulation framework for efficient inference optimization.

Findings

01

Achieves 1.26x speedup in miss throughput bottleneck scenarios.

02

Attains 1.58x speedup with limited cache size over unoptimized systems.

03

Outperforms baseline methods like dyncta in cache optimization for LLM inference.

Abstract

Large Language Models (LLMs) have achieved unprecedented success across various applications, but their substantial memory requirements pose significant challenges to current memory system designs, especially during inference. Our work targets last-level cache (LLC) based architectures, including GPUs (e.g., NVIDIA GPUs) and AI accelerators. We introduce LLaMCAT, a novel approach to optimize the LLC for LLM inference. LLaMCAT combines Miss Status Holding Register (MSHR)- and load balance-aware cache arbitration with thread throttling to address stringent bandwidth demands and minimize cache stalls in KV Cache access. We also propose a hybrid simulation framework integrating analytical models with cycle-level simulators via memory traces, balancing architecture detail and efficiency. Experiments demonstrate that LLaMCAT achieves an average speedup of 1.26x when the system is mainly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Big Data and Digital Economy · Advanced Neural Network Applications