Adaptive Cache Pollution Control for Large Language Model Inference Workloads Using Temporal CNN-Based Prediction and Priority-Aware Replacement
Songze Liu, Hongkun Du, Shaowen Wang

TL;DR
This paper introduces an adaptive cache management system using Temporal CNNs and priority-aware replacement to reduce cache pollution and improve performance in large language model inference workloads.
Contribution
It presents a novel ACPC mechanism combining TCN-based access prediction with dynamic replacement strategies tailored for LLM inference workloads.
Findings
Reduces cache pollution by 41.7%
Improves cache hit rate by 8.9%
Decreases L2 miss penalty by 60%
Abstract
Large Language Models (LLMs), such as GPT and LLaMA, introduce unique memory access characteristics during inference due to frequent token sequence lookups and embedding vector retrievals. These workloads generate highly irregular and bursty access patterns, causing traditional prefetching and replacement policies to mispredict and trigger severe cache pollution, thereby degrading system performance. To address this challenge, this paper proposes an Adaptive Cache Pollution Control (ACPC) mechanism tailored for LLM inference workloads, integrating Temporal Convolutional Network (TCN)-based access prediction with a priority-aware replacement strategy. The TCN module learns temporal dependencies in token access sequences to identify potential high-reuse cache lines, while the replacement policy dynamically adjusts eviction priorities based on predicted reuse likelihood and cache…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Digital Economy · Natural Language Processing Techniques · Advanced Neural Network Applications
