Chimera: Latency- and Performance-Aware Multi-agent Serving for Heterogeneous LLMs
Kangqi Ni, Wenyue Hua, Xiaoxiang Shi, Jiang Guo, Shiyu Chang, Tianlong Chen

TL;DR
Chimera is a predictive scheduling system for multi-agent workflows on heterogeneous LLM clusters, optimizing latency and performance by semantic routing, workload prediction, and congestion estimation.
Contribution
It introduces Chimera, a novel system that effectively schedules multi-agent LLM workflows across heterogeneous models, balancing latency and task accuracy.
Findings
Reduces end-to-end latency by 1.2-2.4x compared to baselines.
Improves task performance by 8.0-9.5 percentage points.
Demonstrates effectiveness on code generation and math reasoning workflows.
Abstract
Multi-agent applications often execute complex tasks as multi-stage workflows, where each stage is an LLM call whose output becomes part of context for subsequent steps. Existing LLM serving systems largely assume homogeneous clusters with identical model replicas. This design overlooks the potential of heterogeneous deployments, where models of different sizes and capabilities enable finer trade-offs between latency and performance. However, heterogeneity introduces new challenges in scheduling across models with diverse throughput and performance. We present Chimera, a predictive scheduling system for multi-agent workflow serving on heterogeneous LLM clusters that jointly improves end-to-end latency and task performance. Chimera applies semantic routing to estimate per-model confidence scores for each request, predicts the total remaining output length of the workflow, and estimates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Scientific Computing and Data Management · Cloud Computing and Resource Management
