Learning Semantics, Not Addresses: Runtime Neural Prefetching for Far Memory
Yutong Huang, Zhiyuan Guo, Yiying Zhang

TL;DR
FarSight is a novel Linux-based far-memory system that uses deep learning to predict access patterns based on application semantics, significantly improving performance over existing methods.
Contribution
It introduces a new approach that decouples application semantics from memory layout, enabling offline-trained models to enhance prefetching in far-memory systems.
Findings
Up to 3.6x performance improvement over state-of-the-art
First to leverage deep learning in Linux-based far-memory prefetching
Effective across four data-intensive workloads
Abstract
Memory prefetching has long boosted CPU caches and is increasingly vital for far-memory systems, where large portions of memory are offloaded to cheaper, remote tiers. While effective prefetching requires accurate prediction of future accesses, prior ML approaches have been limited to simulation or small-scale hardware. We introduce FarSight, the first Linux-based far-memory system to leverage deep learning by decoupling application semantics from runtime memory layout. This separation enables offline-trained models to predict access patterns over a compact ordinal vocabulary, which are resolved at runtime through lightweight mappings. Across four data-intensive workloads, FarSight delivers up to 3.6x higher performance than the state-of-the-art.
Peer Reviews
Decision·Submitted to ICLR 2026
The semantics-addresses decoupling via an ordinal vocabulary and per‑page future maps is a neat formulation that squarely addresses the problem of the 64‑bit address space for sequence models. This differs from prior ML prefetchers that predict concrete addresses or small fixed offsets. Applying a very small RetNet with constant‑time inference to OS‑level far‑memory prefetching, plus the rotary‑reuse encoding to avoid recomputation across sliding windows, is a pragmatic systems‑ML combination. T
The paper compares against FastSwap and Hermit (Linux‑prefetch based) and against a home‑built simulator of Twilight only on MCF, replayed into the far‑memory runtime (Sec. 4, "Baselines"). This raises fairness and validity questions: the mapping from Twilight’s cache‑line predictions to page‑granular far‑memory prefetch is nontrivial and may handicap it; details of the simulator (e.g., clustering choices, history length, throttling, look‑ahead) are sparse. Please provide a more thorough descrip
• The ordinal vocabulary representation reduces the intractable address space problem (2^48 addresses) to a learnable vocabulary of K=64. This enables accurate prediction with a model that fits in L1/L2 cache • This appears to the first Linux-based far-memory system to successfully deploy deep learning in production rather than simulation-only results, with 5.5K lines of kernel code demonstrating real-world feasibility • Multiple clever optimizations work synergistically: using page miss
A) Empirical evaluation weaknesses (Please also see questions for how to improve): i) Lack of comprehensive evaluation, specifically paucity of **diverse** workloads; they evaluate only four workloads, with three being graph-based (MCF, PageRank, SSSP). Paper claims applicability to "graph processing, tree and index structures, pointer chasing, and recursive data structures" (lines 37-40) but provides no evidence beyond graph traversals, which is only one of many challenging graph analytic metri
* Improving the speed of computation is good for almost everyone: faster compute = cheaper runtime = lower carbon footprint. * Prefetching indeed speeds-up programs, e.g., especially in for-loops (I witnessed this first-hand)! * Authors have implemented in Linux already!
* The main weakness -- which is not a real weakness -- is the relevance of this paper to the ICLR community. The paper is very system-like -- descriptive text but not a single equation or model. It would have been more relevant if it was presented in a way that appeals to the audience. For example, the actual model (mod, feeding into ML) should be written as equations -- they are less to process than lines of text, especially as we are used to see them on the regular. I am not sure the method a
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Stochastic Gradient Optimization Techniques · Advanced Image and Video Retrieval Techniques
