Not All Thoughts Need HBM: Semantics-Aware Memory Hierarchy for LLM Reasoning
Aojie Yuan, Tianqi Shen, Dajun Zhang

TL;DR
This paper proposes a semantics-aware memory hierarchy for LLM reasoning that offloads low-importance tokens to CPU memory, maintaining accuracy while significantly reducing GPU memory usage.
Contribution
It introduces a novel token offloading method based on cumulative attention scores, formalized as zero-approximation-error offloading, with empirical validation across multiple scales and benchmarks.
Findings
Accuracy depends on eviction ratio, not remaining HBM tokens.
With 3% eviction, retains 91% of full-cache accuracy on GSM8K.
Halves HBM occupancy at 14B scale while matching baseline accuracy.
Abstract
Reasoning LLMs produce thousands of chain-of-thought tokens whose KV cache must reside in scarce GPU HBM. The dominant response -- permanently evicting low-importance tokens -- is catastrophic for reasoning: accuracy collapses to 0-2.5% when half the cache is removed. We ask a different question: must every token live in HBM, or can some live elsewhere? We introduce a semantics-aware memory hierarchy that sorts tokens into four tiers -- HBM, DDR, compressed, and evicted -- using cumulative attention scoring. Low-importance tokens are moved to CPU memory rather than destroyed; before each attention step they are prefetched back at full precision, contributing exactly the same terms as if they had never left the GPU. We formalize this as zero-approximation-error offloading and derive our central finding: accuracy depends solely on how many tokens are permanently discarded (the eviction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
