Architecting Long-Context LLM Acceleration with Packing-Prefetch Scheduler and Ultra-Large Capacity On-Chip Memories

Ming-Yen Lee; Faaiq Waqar; Hanchen Yang; Muhammed Ahosan Ul Karim; Harsono Simka; Shimeng Yu

arXiv:2508.08457·cs.AR·August 13, 2025

Architecting Long-Context LLM Acceleration with Packing-Prefetch Scheduler and Ultra-Large Capacity On-Chip Memories

Ming-Yen Lee, Faaiq Waqar, Hanchen Yang, Muhammed Ahosan Ul Karim, Harsono Simka, Shimeng Yu

PDF

Open Access

TL;DR

This paper introduces a novel architecture combining packing-prefetch scheduling and ultra-large on-chip memories to significantly accelerate long-context LLM inference, reducing latency and HBM bandwidth bottlenecks.

Contribution

It proposes a new integrated approach with packing, prefetching, and monolithic 3D memories to improve long-context LLM inference performance and efficiency.

Findings

01

8.06x decode speedup on Llama3.1-8B

02

1.83x overall latency reduction

03

1.7x-2.4x throughput improvement

Abstract

Long-context Large Language Model (LLM) inference faces increasing compute bottlenecks as attention calculations scale with context length, primarily due to the growing KV-cache transfer overhead that saturates High Bandwidth Memory (HBM). While prefetching techniques mitigate cache misses by fetching KV data in advance, their spatial and temporal benefits present new opportunities to exploit. This work proposes a packing-prefetch scheduling architecture with monolithic 3D (M3D) back-end-of-line (BEOL) compatible embedded memories with ultra-large on-chip capacity to accelerate long-context LLM inference. Our optimizations demonstrate 8.06x decode speedup and 1.83x overall latency reduction on Llama3.1-8B using TPUv6e-like hardware with additional 512MB BEOL memories over the serial execution. Evaluations of multi-request workloads on TPU-like architectures show 1.7x-2.4x throughput…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Big Data and Digital Economy