HillInfer: Efficient Long-Context LLM Inference on the Edge with Hierarchical KV Eviction using SmartSSD

He Sun; Shinan Liu; Li Li; Mingjun Xiao

arXiv:2602.18750·cs.AR·March 26, 2026

HillInfer: Efficient Long-Context LLM Inference on the Edge with Hierarchical KV Eviction using SmartSSD

He Sun, Shinan Liu, Li Li, Mingjun Xiao

PDF

Open Access

TL;DR

HillInfer introduces a hierarchical cache management and smart SSD-based token importance evaluation to enable efficient, low-latency long-context LLM inference on edge devices with limited memory.

Contribution

It proposes a novel CSD-assisted KV eviction framework with hierarchical cache management and adaptive prefetching for efficient long-context inference on AIPCs.

Findings

01

Achieves up to 8.56× speedup over state-of-the-art methods.

02

Reduces I/O bottlenecks and resource exhaustion issues.

03

Maintains model accuracy while improving inference efficiency.

Abstract

Deploying Large Language Models (LLMs) on memory-constrained AI Personal Computers (AIPCs) enables low-latency, privacy-preserving inference, but long-context generation is fundamentally bottlenecked by the linearly growing Key-Value (KV) cache. While dynamic KV eviction mitigates this memory wall, existing offloading strategies either trigger crippling PCIe I/O bottlenecks on standard SSDs or suffer from FPGA resource exhaustion by forcing compute-intensive exact attention on a single, weak Computational Storage Drive (CSD). In this paper, we propose HillInfer, a CSD-assisted KV eviction framework that introduces a paradigm shift: offloading strictly lightweight token importance evaluation to a single CSD (e.g., SmartSSD) on AIPCs. To fully capitalize on this lightweight offloading strategy, HillInfer orchestrates a Hierarchical KV Cache Manager (HKM) that leverages temporal locality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBig Data and Digital Economy · Advanced Neural Network Applications · Data Quality and Management