Stream2LLM: Overlap Context Streaming and Prefill for Reduced Time-to-First-Token (TTFT)

Rajveer Bachkaniwala; Chengqi Luo; Richard So; Divya Mahajan; Kexin Rong

arXiv:2604.16395·cs.DB·May 19, 2026

Stream2LLM: Overlap Context Streaming and Prefill for Reduced Time-to-First-Token (TTFT)

Rajveer Bachkaniwala, Chengqi Luo, Richard So, Divya Mahajan, Kexin Rong

PDF

1 Repo

TL;DR

Stream2LLM is a streaming-aware system that reduces time-to-first-token in LLM inference by overlapping retrieval and inference, employing adaptive scheduling and redundancy minimization techniques.

Contribution

It introduces a novel streaming-aware LLM serving system with adaptive scheduling, preemption, and cache optimization for concurrent prefill-decode deployments.

Findings

01

Up to 11x TTFT improvements with streaming architecture.

02

Cost-aware scheduling benefits under memory pressure.

03

Maintains throughput parity with non-streaming baselines.

Abstract

Context retrieval systems for LLM inference face a critical challenge: high retrieval latency creates a fundamental tension between waiting for complete context (poor time-to-first-token) and proceeding without it (reduced quality). Streaming context incrementally--overlapping retrieval with inference--can mitigate this latency, but doing so with concurrent requests introduces new challenges: requests contend for GPU compute and memory, and scheduling must adapt to dynamic context arrivals. We present Stream2LLM, a streaming-aware LLM serving system for concurrent prefill-decode disaggregated deployments. Stream2LLM introduces adaptive scheduling and preemption for two distinct retrieval patterns: append-mode (progressive context accumulation) and update-mode (iterative refinement with cache invalidation). It decouples scheduling decisions from resource acquisition, enabling flexible…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rajveerb/stream2llm/tree/mlsys_artifact
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.