TL;DR
Cascadia is a novel cascade serving framework that optimizes request routing and system deployment for large language models, significantly improving latency, throughput, and quality in LLM serving systems.
Contribution
It introduces a bi-level optimization approach combining mixed-integer linear programming and Chebyshev-guided routing to efficiently manage heterogeneous LLM workloads.
Findings
Achieves up to 4× faster latency SLOs and 5× higher throughput.
Outperforms existing cascade serving baselines on diverse workloads.
Maintains high answer quality while optimizing system performance.
Abstract
Recent advances in large language models (LLMs) have intensified the need to deliver both rapid responses and high-quality outputs. More powerful models yield better results but incur higher inference latency, whereas smaller models are faster yet less capable. Recent work proposes balancing this latency-quality trade-off using model cascades, which route simpler queries to smaller models and more complex ones to larger models. However, enabling efficient cascade serving remains challenging. Current frameworks lack effective mechanisms for handling (i) the huge and varying resource demands of different LLMs, (ii) the inherent heterogeneity of LLM workloads, and (iii) the co-optimization of system deployment and routing strategy. Motivated by these observations, we introduce Cascadia, a novel cascade serving framework designed explicitly to schedule request routing and deploy model…
Peer Reviews
Decision·ICLR 2026 Poster
1. The joint formulation of resource allocation and adaptive inference via model cascades is a novel approach to cost-efficient LLM inference and is a very important problem that needs to be solved before cascades can be deployed in real world systems. The paper also considers heterogeneity in model and workload characteristics which are important considerations in real settings. 2. The paper proposes a viable solution to the problem via bi-level optimization that helps to find an appropriate d
1. Some of the details of the approach are not clearly explained. I have added several questions below around points that were not clear to me. 2. While the bi-level optimization itself doesn't seem to be taking too long to solve in online settings (Section 4.4), the latency of re-allocating the models/changing the parallelization may be high. 3. The approach does not consider prefix caching even though the traces used in the experiments do contain multi-turn conversations and prefix caching i
Stengths: 1. Efficient serving multiple LLM to balance accuracy and latency is an important topic. 2. The proposed cascading method intuitively can help the multi-model serving system. 3. Extensive experiments show the performance.
1. The main concern is on the real-time efficiency and cost. LLM serving is an online process. If using GPT-4 to judge the small model response, runs GPT-4 takes a few seconds and the cost is expensive. 2. Time to first token is also very long. For simple prompt, it also needs to wait until GPT-4 finishes the judge. 3. The baselines are insufficient. BERT-based router [1, 2, 3] that directly routes prompt to multiple LLMs can be compared. [1] https://github.com/vllm-project/semantic-router
1. This paper studies efficient and effective LLM serving, which is critical problem for a wide range of real-world applications. 2. This paper introduces Cascadia with rigorous and solid technical developments. 3. Cascadia achieves up to 4x lower latency deadlines and 5x higher system throughputs, which are impressive.
1. In Algorithm 1, Cascadia relies on iteratively optimizing both deployment and routing strategies. It remains unclear if the bi-level optimization is theoretically optimal or not. Such guarantees could be critical in practical scenarios. 2. LLM routing is another well-studied technique aiming for efficient & effective LLM serving, which is under-discussed in this paper. Authors may want to discuss and compare to this line of work to better position the contribution of this paper. Several examp
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
