Unsolvability Ceiling in Multi-LLM Routing: An Empirical Study of Evaluation Artifacts
Saloni Garg, Amit Sagtani

TL;DR
This study reveals that evaluation artifacts significantly inflate the perceived unsolvability ceiling in multi-LLM routing, affecting model assessment and routing strategies.
Contribution
It identifies key evaluation artifacts that distort unsolvability measurements and proposes validation methods to improve assessment accuracy in multi-LLM routing.
Findings
Evaluation artifacts inflate unsolvability estimates.
Dual-judge validation reduces measured unsolvability.
Standard routing collapses to majority-class prediction (~79%).
Abstract
Efficient routing across multiple LLMs enables cost-quality tradeoffs by directing queries to the cheapest capable model. Prior work attributes routing headroom to an "unsolvability ceiling", queries no model in the pool can solve. We present a large-scale study of multi-tier LLM routing with 206,000 query-model pairs across six benchmarks (MMLU, MedQA, HumanEval, MBPP, Alpaca, ShareGPT) using the Gemma 4 and Llama 3.1 families. Evaluating with both LLM-as-a-judge and exact-match metrics, we show that a substantial portion of reported unsolvability stems from evaluation artifacts: (i) systematic judge biases favoring verbosity over correctness, (ii) truncation under fixed generation budgets, and (iii) output format mismatches. Through dual-judge validation and exact-match grounding, we reduce measured unsolvability across tasks. We introduce a decomposition framework attributing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
