Prefill vs. Decode Bottlenecks: SRAM-Frequency Tradeoffs and the Memory-Bandwidth Ceiling
Hannah Atmer, Yuan Yao, Thiemo Voigt, Stefanos Kaxiras

TL;DR
This paper analyzes how SRAM size and operating frequency affect energy efficiency and performance in LLM inference, revealing that memory bandwidth limits gains from higher frequencies and larger buffers.
Contribution
It provides a detailed simulation-based analysis of the tradeoffs between SRAM size, frequency, and energy efficiency in LLM accelerators, highlighting the memory bandwidth bottleneck.
Findings
Larger SRAM buffers increase static energy without latency benefits.
High frequencies reduce prefill latency but are limited by memory bandwidth.
Optimal configuration balances high frequency with small buffers for best energy-delay performance.
Abstract
Energy consumption dictates the cost and environmental impact of deploying Large Language Models. This paper investigates the impact of on-chip SRAM size and operating frequency on the energy efficiency and performance of LLM inference, focusing on the distinct behaviors of the compute-bound prefill and memory-bound decode phases. Our simulation methodology combines OpenRAM for energy modeling, LLMCompass for latency simulation, and ScaleSIM for systolic array operational intensity. Our findings show that total energy use is predominantly determined by SRAM size in both phases, with larger buffers significantly increasing static energy due to leakage, which is not offset by corresponding latency benefits. We quantitatively explore the memory-bandwidth bottleneck, demonstrating that while high operating frequencies reduce prefill latency, their positive impact on memory-bound decode…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Embedded Systems Design Techniques · Big Data and Digital Economy
