TL;DR
Stream2LLM is a streaming-aware system that reduces time-to-first-token in LLM inference by overlapping retrieval and inference, employing adaptive scheduling and redundancy minimization techniques.
Contribution
It introduces a novel streaming-aware LLM serving system with adaptive scheduling, preemption, and cache optimization for concurrent prefill-decode deployments.
Findings
Up to 11x TTFT improvements with streaming architecture.
Cost-aware scheduling benefits under memory pressure.
Maintains throughput parity with non-streaming baselines.
Abstract
Context retrieval systems for LLM inference face a critical challenge: high retrieval latency creates a fundamental tension between waiting for complete context (poor time-to-first-token) and proceeding without it (reduced quality). Streaming context incrementally--overlapping retrieval with inference--can mitigate this latency, but doing so with concurrent requests introduces new challenges: requests contend for GPU compute and memory, and scheduling must adapt to dynamic context arrivals. We present Stream2LLM, a streaming-aware LLM serving system for concurrent prefill-decode disaggregated deployments. Stream2LLM introduces adaptive scheduling and preemption for two distinct retrieval patterns: append-mode (progressive context accumulation) and update-mode (iterative refinement with cache invalidation). It decouples scheduling decisions from resource acquisition, enabling flexible…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
