TL;DR
ContextPilot is a system that accelerates long-context inference in LLMs by reusing overlapping context blocks, reducing latency up to 3 times while maintaining or improving reasoning quality.
Contribution
It introduces a novel context reuse mechanism with a context index, ordering, de-duplication, and annotations to enhance speed without sacrificing reasoning quality.
Findings
Reduces LLM prefill latency by up to 3x
Preserves reasoning quality during context reuse
Can improve reasoning quality at longer context lengths
Abstract
AI applications increasingly depend on long-context inference, where LLMs consume substantial context to support stronger reasoning. Common examples include retrieval-augmented generation, agent memory layers, and multi-agent orchestration. As input contexts get longer, prefill latency becomes the main bottleneck. Yet today's prefill acceleration techniques face a trade-off: they either preserve reasoning quality but deliver little KV-cache reuse, or improve reuse at the cost of degraded reasoning quality. We present ContextPilot, a system that accelerates prefill by introducing context reuse as a new mechanism for faster long-context inference. ContextPilot introduces a context index to identify overlapping context blocks across LLM interactions (e.g., across users and turns). It further proposes context ordering and de-duplication techniques to maximize KV-cache reuse. To preserve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
