Using Span Queries to Optimize for Cache and Attention Locality

Paul Castro; Nick Mitchell; Nathan Ordonez; Thomas Parnell; Mudhakar Srivatsa; Antoni Viros i Martin

arXiv:2511.02749·cs.AI·November 5, 2025

Using Span Queries to Optimize for Cache and Attention Locality

Paul Castro, Nick Mitchell, Nathan Ordonez, Thomas Parnell, Mudhakar Srivatsa, Antoni Viros i Martin

PDF

Open Access

TL;DR

This paper introduces span queries as a flexible interface for inference servers, enabling optimization for various workloads like chat, RAG, and reasoning, leading to significant performance improvements and better cache and attention locality.

Contribution

The paper proposes span queries as a unified, expressive interface for inference workloads, with automatic optimization for cache and attention locality, and demonstrates practical implementation and performance gains.

Findings

01

Achieved 10-20x reductions in TTFT for non-chat workloads.

02

Enabled high-performance execution with a small code change in vLLM.

03

Improved attention locality to outperform larger models on accuracy.

Abstract

Clients are evolving beyond chat completion, and now include a variety of innovative inference-time scaling and deep reasoning techniques. At the same time, inference servers remain heavily optimized for chat completion. Prior work has shown that large improvements to KV cache hit rate are possible if inference servers evolve towards these non-chat use cases. However, they offer solutions that are also optimized for a single use case, RAG. In this paper, we introduce the span query to generalize the interface to the inference server. We demonstrate that chat, RAG, inference-time scaling, and agentic workloads can all be expressed as span queries. We show how the critical distinction that had been assumed by prior work lies in whether the order of the inputs matter -- do they commute? In chat, they do not. In RAG, they often do. This paper introduces span queries, which are expression…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCaching and Content Delivery · Parallel Computing and Optimization Techniques · Distributed systems and fault tolerance