Beyond GPT-5: Making LLMs Cheaper and Better via Performance-Efficiency Optimized Routing

Yiqun Zhang; Hao Li; Jianhao Chen; Hangfan Zhang; Peng Ye; Lei Bai; Shuyue Hu

arXiv:2508.12631·cs.CL·October 23, 2025

Beyond GPT-5: Making LLMs Cheaper and Better via Performance-Efficiency Optimized Routing

Yiqun Zhang, Hao Li, Jianhao Chen, Hangfan Zhang, Peng Ye, Lei Bai, Shuyue Hu

PDF

Open Access

TL;DR

Avengers-Pro is a test-time routing framework that dynamically ensembles LLMs of different capacities to optimize the balance between performance and efficiency, achieving state-of-the-art results across multiple benchmarks.

Contribution

It introduces Avengers-Pro, a novel test-time routing method that ensembles diverse LLMs, enabling flexible tradeoffs and superior accuracy-cost efficiency compared to single models.

Findings

01

Surpasses the strongest single model by +7% accuracy

02

Achieves 27% lower cost while matching top accuracy

03

Reaches ~90% of top model performance at 63% lower cost

Abstract

Balancing performance and efficiency is a central challenge in large language model (LLM) advancement. GPT-5 addresses this with test-time routing, dynamically assigning queries to either an efficient or a high-capacity model during inference. In this work, we present Avengers-Pro, a test-time routing framework that ensembles LLMs of varying capacities and efficiencies, providing a unified solution for all performance-efficiency tradeoffs. The Avengers-Pro embeds and clusters incoming queries, then routes each to the most suitable model based on a performance-efficiency score. Across 6 challenging benchmarks and 8 leading models -- including GPT-5-medium, Gemini-2.5-pro, and Claude-opus-4.1 -- Avengers-Pro achieves state-of-the-art results: by varying a performance-efficiency trade-off parameter, it can surpass the strongest single model (GPT-5-medium) by +7% in average accuracy.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScheduling and Optimization Algorithms