Unsolvability Ceiling in Multi-LLM Routing: An Empirical Study of Evaluation Artifacts

Saloni Garg; Amit Sagtani

arXiv:2605.07395·cs.LG·May 11, 2026

Unsolvability Ceiling in Multi-LLM Routing: An Empirical Study of Evaluation Artifacts

Saloni Garg, Amit Sagtani

PDF

TL;DR

This study reveals that evaluation artifacts significantly inflate the perceived unsolvability ceiling in multi-LLM routing, affecting model assessment and routing strategies.

Contribution

It identifies key evaluation artifacts that distort unsolvability measurements and proposes validation methods to improve assessment accuracy in multi-LLM routing.

Findings

01

Evaluation artifacts inflate unsolvability estimates.

02

Dual-judge validation reduces measured unsolvability.

03

Standard routing collapses to majority-class prediction (~79%).

Abstract

Efficient routing across multiple LLMs enables cost-quality tradeoffs by directing queries to the cheapest capable model. Prior work attributes routing headroom to an "unsolvability ceiling", queries no model in the pool can solve. We present a large-scale study of multi-tier LLM routing with 206,000 query-model pairs across six benchmarks (MMLU, MedQA, HumanEval, MBPP, Alpaca, ShareGPT) using the Gemma 4 and Llama 3.1 families. Evaluating with both LLM-as-a-judge and exact-match metrics, we show that a substantial portion of reported unsolvability stems from evaluation artifacts: (i) systematic judge biases favoring verbosity over correctness, (ii) truncation under fixed generation budgets, and (iii) output format mismatches. Through dual-judge validation and exact-match grounding, we reduce measured unsolvability across tasks. We introduce a decomposition framework attributing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.