Cost-Aware Contrastive Routing for LLMs
Reza Shirkavand, Shangqian Gao, Peiran Yu, Heng Huang

TL;DR
This paper introduces CSCR, a cost-sensitive routing framework for large language models that efficiently selects the most appropriate model based on prompt context, improving accuracy-cost tradeoffs with minimal latency.
Contribution
The paper presents a novel contrastive routing method that uses shared embeddings and fast lookup to enable cost-aware model selection without retraining.
Findings
CSCR improves accuracy-cost tradeoff by up to 25%.
It generalizes well to unseen models and prompts.
Routing latency is reduced to microseconds.
Abstract
We study cost-aware routing for large language models across diverse and dynamic pools of models. Existing approaches often overlook prompt-specific context, rely on expensive model profiling, assume a fixed set of experts, or use inefficient trial-and-error strategies. We introduce Cost-Spectrum Contrastive Routing (CSCR), a lightweight framework that maps both prompts and models into a shared embedding space to enable fast, cost-sensitive selection. CSCR uses compact, fast-to-compute logit footprints for open-source models and perplexity fingerprints for black-box APIs. A contrastive encoder is trained to favor the cheapest accurate expert within adaptive cost bands. At inference time, routing reduces to a single k-NN lookup via a FAISS index, requiring no retraining when the expert pool changes and enabling microsecond latency. Across multiple benchmarks, CSCR consistently…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPeer-to-Peer Network Technologies · Caching and Content Delivery · Advanced Optical Network Technologies
