HillInfer: Efficient Long-Context LLM Inference on the Edge with Hierarchical KV Eviction using SmartSSD
He Sun, Shinan Liu, Li Li, Mingjun Xiao

TL;DR
HillInfer introduces a hierarchical cache management and smart SSD-based token importance evaluation to enable efficient, low-latency long-context LLM inference on edge devices with limited memory.
Contribution
It proposes a novel CSD-assisted KV eviction framework with hierarchical cache management and adaptive prefetching for efficient long-context inference on AIPCs.
Findings
Achieves up to 8.56× speedup over state-of-the-art methods.
Reduces I/O bottlenecks and resource exhaustion issues.
Maintains model accuracy while improving inference efficiency.
Abstract
Deploying Large Language Models (LLMs) on memory-constrained AI Personal Computers (AIPCs) enables low-latency, privacy-preserving inference, but long-context generation is fundamentally bottlenecked by the linearly growing Key-Value (KV) cache. While dynamic KV eviction mitigates this memory wall, existing offloading strategies either trigger crippling PCIe I/O bottlenecks on standard SSDs or suffer from FPGA resource exhaustion by forcing compute-intensive exact attention on a single, weak Computational Storage Drive (CSD). In this paper, we propose HillInfer, a CSD-assisted KV eviction framework that introduces a paradigm shift: offloading strictly lightweight token importance evaluation to a single CSD (e.g., SmartSSD) on AIPCs. To fully capitalize on this lightweight offloading strategy, HillInfer orchestrates a Hierarchical KV Cache Manager (HKM) that leverages temporal locality…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Digital Economy · Advanced Neural Network Applications · Data Quality and Management
