The Missing Memory Hierarchy: Demand Paging for LLM Context Windows
Tony Mason

TL;DR
This paper introduces Pichay, a demand paging system for large language models' context windows, significantly reducing memory usage and addressing virtual memory challenges in LLMs.
Contribution
It presents a novel demand paging architecture for LLMs, implementing a multi-level memory hierarchy and demonstrating substantial context reduction in production.
Findings
Reduces context consumption by up to 93% in live deployment
Fault rate in offline replay is 0.0254%
System remains operational under extreme pressure despite thrashing
Abstract
The context window of a large language model is not memory. It is L1 cache: a small, fast, expensive resource that the field treats as the entire memory system. There is no L2, no virtual memory, no paging. Every tool definition, every system prompt, and every stale tool result occupies context for the lifetime of the session. The result is measurable: across 857 production sessions and 4.45 million effective input tokens, 21.8% is structural waste. We present Pichay, a demand paging system for LLM context windows. Implemented as a transparent proxy between client and inference API, Pichay interposes on the message stream to evict stale content, detect page faults when the model re-requests evicted material, and pin working-set pages identified by fault history. In offline replay across 1.4 million simulated evictions, the fault rate is 0.0254%. In live production deployment over…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Distributed systems and fault tolerance · Software System Performance and Reliability
