RouteNLP: Closed-Loop LLM Routing with Conformal Cascading and Distillation Co-Optimization
Dongxin Guo, Jikun Wu, Siu Ming Yiu

TL;DR
RouteNLP is a cost-effective, closed-loop framework that intelligently routes NLP queries across models, reducing inference costs by up to 58% while maintaining high quality and low latency in real-world and benchmark settings.
Contribution
The paper introduces a novel routing framework combining confidence calibration, conformal prediction, and distillation co-optimization to significantly reduce NLP inference costs.
Findings
Reduced inference costs by 58% in real deployment.
Achieved 40-85% cost reduction on benchmark tasks.
Maintained high quality with 96-100% accuracy and 74.5% human-rated output quality.
Abstract
Serving diverse NLP workloads with large language models is costly: at one enterprise partner, inference costs exceeded $200K/month despite over 70% of queries being routine tasks well within the capability of smaller models. We present RouteNLP, a closed-loop framework that routes queries across a tiered model portfolio to minimize cost while satisfying per-task quality constraints. The framework integrates three components: a difficulty-aware router with shared task-conditioned representations trained on preference data and quality signals; confidence-calibrated cascading that uses conformal prediction for distribution-free threshold initialization; and a distillation-routing co-optimization loop that clusters escalation failures, applies targeted knowledge distillation to cheaper models, and automatically retrains the router, yielding over twice the cost improvement of untargeted…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
