LLMCache: Layer-Wise Caching Strategies for Accelerated Reuse in Transformer Inference

Harsh Vardhan Bansal

arXiv:2512.16843·cs.CL·March 3, 2026

LLMCache: Layer-Wise Caching Strategies for Accelerated Reuse in Transformer Inference

Harsh Vardhan Bansal

PDF

TL;DR

LLMCache introduces a layer-wise caching framework that reuses intermediate transformer activations based on semantic similarity, significantly speeding up inference in models like BERT and GPT-2 with minimal accuracy loss.

Contribution

It proposes a model-agnostic, layer-wise caching method with semantic matching and adaptive eviction, extending caching beyond token-level approaches for transformer inference.

Findings

01

Up to 3.1x inference speedup on BERT and GPT-2.

02

Less than 0.5% accuracy degradation.

03

Effective across encoder and decoder architectures.

Abstract

Transformer-based language models have achieved remarkable performance across a wide range of tasks, yet their high inference latency poses a significant challenge for real-timeand large-scale deployment. While existing caching mechanisms,such as token-level key-value caches, offer speedups in autore-gressive decoding, they are limited in scope and applicability. In this paper, we present LLMCache, a novel layer-wise caching framework that accelerates transformer inference by reusing intermediate activations based on semantic similarity of input sequences. Unlike prior work, LLMCache is model-agnostic,operates across both encoder and decoder architectures, and supports caching at arbitrary transformer layers. We introduce a lightweight fingerprinting mechanism for matching seman-tically similar inputs and propose adaptive eviction strategies to manage cache staleness. Experiments on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.