Fail Fast, or Ask: Mitigating the Deficiencies of Reasoning LLMs with Human-in-the-Loop Systems Engineering
Michael J. Zellinger, Matt Thomson

TL;DR
This paper proposes a human-in-the-loop system to improve reasoning LLMs by deferring uncertain queries to humans, reducing errors and latency, and optimizing cost-effectiveness in high-volume, risk-sensitive applications.
Contribution
It introduces a hybrid approach combining non-reasoning and reasoning models with human oversight, demonstrating significant error reduction and latency improvements without internal model access.
Findings
Deferring uncertain queries to humans reduces error rates from 3% to below 1%.
Using a non-reasoning model as a front reduces latency by around 40%.
Latency drag affects the expected latency savings, especially on easier queries.
Abstract
State-of-the-art reasoning LLMs are powerful problem solvers, but they still occasionally make mistakes. However, adopting AI models in risk-sensitive domains often requires error rates near 0%. To address this gap, we propose collaboration between a reasoning model and a human expert who resolves queries the model cannot confidently answer. We find that quantifying the uncertainty of a reasoning model through the length of its reasoning trace yields an effective basis for deferral to a human, e.g., cutting the error rate of Qwen3 235B-A22B on difficult MATH problems from 3% to less than 1% when deferring 7.5% of queries. However, the high latency of reasoning models still makes them challenging to deploy on use cases with high query volume. To address this challenge, we explore fronting a reasoning model with a large non-reasoning model. We call this modified human-in-the-loop system…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
