Accuracy Is Speed: Towards Long-Context-Aware Routing for Distributed LLM Serving
Takeshi Yoshimura, Valentijn Dymphnus van de Beek, Tatsuhiro Chiba

TL;DR
This paper introduces a new metric, TTCA, to measure the time until a correct response in long-context distributed LLM serving, emphasizing accuracy's role in speed.
Contribution
It proposes Lightweight Accuracy-Aware Routing (LAAR), a novel routing method that reduces TTCA by considering accuracy as a key system objective.
Findings
Prompt length and language increase accuracy variance and TTCA.
LAAR reduces TTCA in long-context distributed LLM serving.
Accuracy-aware routing improves overall response reliability.
Abstract
Distributed LLM serving systems optimize per-request latency and throughput. However, under long-context workloads, inference accuracy becomes more variable. When incorrect responses trigger retries, accuracy directly translates into cumulative user-visible delay that is not captured by single-shot latency metrics. In this work, we argue that under long-context serving, \textbf{accuracy becomes speed} through retry dynamics. We introduce \textit{Time-to-Correct-Answer (TTCA)}, a metric that measures the wall-clock time required to obtain the first correct response. Our measurement study shows that prompt characteristics such as length and language amplify accuracy variance, which inflates TTCA. We demonstrate \textit{Lightweight Accuracy-Aware Routing (LAAR)}, a capability-based routing design that reduces TTCA. Our results suggest that in long-context distributed serving, accuracy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
