Using Span Queries to Optimize for Cache and Attention Locality
Paul Castro, Nick Mitchell, Nathan Ordonez, Thomas Parnell, Mudhakar Srivatsa, Antoni Viros i Martin

TL;DR
This paper introduces span queries as a flexible interface for inference servers, enabling optimization for various workloads like chat, RAG, and reasoning, leading to significant performance improvements and better cache and attention locality.
Contribution
The paper proposes span queries as a unified, expressive interface for inference workloads, with automatic optimization for cache and attention locality, and demonstrates practical implementation and performance gains.
Findings
Achieved 10-20x reductions in TTFT for non-chat workloads.
Enabled high-performance execution with a small code change in vLLM.
Improved attention locality to outperform larger models on accuracy.
Abstract
Clients are evolving beyond chat completion, and now include a variety of innovative inference-time scaling and deep reasoning techniques. At the same time, inference servers remain heavily optimized for chat completion. Prior work has shown that large improvements to KV cache hit rate are possible if inference servers evolve towards these non-chat use cases. However, they offer solutions that are also optimized for a single use case, RAG. In this paper, we introduce the span query to generalize the interface to the inference server. We demonstrate that chat, RAG, inference-time scaling, and agentic workloads can all be expressed as span queries. We show how the critical distinction that had been assumed by prior work lies in whether the order of the inputs matter -- do they commute? In chat, they do not. In RAG, they often do. This paper introduces span queries, which are expression…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCaching and Content Delivery · Parallel Computing and Optimization Techniques · Distributed systems and fault tolerance
