Not All Thoughts Need HBM: Semantics-Aware Memory Hierarchy for LLM Reasoning

Aojie Yuan; Tianqi Shen; Dajun Zhang

arXiv:2605.09490·cs.CL·May 12, 2026

Not All Thoughts Need HBM: Semantics-Aware Memory Hierarchy for LLM Reasoning

Aojie Yuan, Tianqi Shen, Dajun Zhang

PDF

TL;DR

This paper proposes a semantics-aware memory hierarchy for LLM reasoning that offloads low-importance tokens to CPU memory, maintaining accuracy while significantly reducing GPU memory usage.

Contribution

It introduces a novel token offloading method based on cumulative attention scores, formalized as zero-approximation-error offloading, with empirical validation across multiple scales and benchmarks.

Findings

01

Accuracy depends on eviction ratio, not remaining HBM tokens.

02

With 3% eviction, retains 91% of full-cache accuracy on GSM8K.

03

Halves HBM occupancy at 14B scale while matching baseline accuracy.

Abstract

Reasoning LLMs produce thousands of chain-of-thought tokens whose KV cache must reside in scarce GPU HBM. The dominant response -- permanently evicting low-importance tokens -- is catastrophic for reasoning: accuracy collapses to 0-2.5% when half the cache is removed. We ask a different question: must every token live in HBM, or can some live elsewhere? We introduce a semantics-aware memory hierarchy that sorts tokens into four tiers -- HBM, DDR, compressed, and evicted -- using cumulative attention scoring. Low-importance tokens are moved to CPU memory rather than destroyed; before each attention step they are prefetched back at full precision, contributing exactly the same terms as if they had never left the GPU. We formalize this as zero-approximation-error offloading and derive our central finding: accuracy depends solely on how many tokens are permanently discarded (the eviction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.