The Missing Memory Hierarchy: Demand Paging for LLM Context Windows

Tony Mason

arXiv:2603.09023·cs.OS·March 11, 2026

The Missing Memory Hierarchy: Demand Paging for LLM Context Windows

Tony Mason

PDF

Open Access

TL;DR

This paper introduces Pichay, a demand paging system for large language models' context windows, significantly reducing memory usage and addressing virtual memory challenges in LLMs.

Contribution

It presents a novel demand paging architecture for LLMs, implementing a multi-level memory hierarchy and demonstrating substantial context reduction in production.

Findings

01

Reduces context consumption by up to 93% in live deployment

02

Fault rate in offline replay is 0.0254%

03

System remains operational under extreme pressure despite thrashing

Abstract

The context window of a large language model is not memory. It is L1 cache: a small, fast, expensive resource that the field treats as the entire memory system. There is no L2, no virtual memory, no paging. Every tool definition, every system prompt, and every stale tool result occupies context for the lifetime of the session. The result is measurable: across 857 production sessions and 4.45 million effective input tokens, 21.8% is structural waste. We present Pichay, a demand paging system for LLM context windows. Implemented as a transparent proxy between client and inference API, Pichay interposes on the message stream to evict stale content, detect page faults when the model re-requests evicted material, and pin working-set pages identified by fault history. In offline replay across 1.4 million simulated evictions, the fault rate is 0.0254%. In live production deployment over…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Distributed systems and fault tolerance · Software System Performance and Reliability