Fail Fast, or Ask: Mitigating the Deficiencies of Reasoning LLMs with Human-in-the-Loop Systems Engineering

Michael J. Zellinger; Matt Thomson

arXiv:2507.14406·cs.AI·July 22, 2025

Fail Fast, or Ask: Mitigating the Deficiencies of Reasoning LLMs with Human-in-the-Loop Systems Engineering

Michael J. Zellinger, Matt Thomson

PDF

TL;DR

This paper proposes a human-in-the-loop system to improve reasoning LLMs by deferring uncertain queries to humans, reducing errors and latency, and optimizing cost-effectiveness in high-volume, risk-sensitive applications.

Contribution

It introduces a hybrid approach combining non-reasoning and reasoning models with human oversight, demonstrating significant error reduction and latency improvements without internal model access.

Findings

01

Deferring uncertain queries to humans reduces error rates from 3% to below 1%.

02

Using a non-reasoning model as a front reduces latency by around 40%.

03

Latency drag affects the expected latency savings, especially on easier queries.

Abstract

State-of-the-art reasoning LLMs are powerful problem solvers, but they still occasionally make mistakes. However, adopting AI models in risk-sensitive domains often requires error rates near 0%. To address this gap, we propose collaboration between a reasoning model and a human expert who resolves queries the model cannot confidently answer. We find that quantifying the uncertainty of a reasoning model through the length of its reasoning trace yields an effective basis for deferral to a human, e.g., cutting the error rate of Qwen3 235B-A22B on difficult MATH problems from 3% to less than 1% when deferring 7.5% of queries. However, the high latency of reasoning models still makes them challenging to deploy on use cases with high query volume. To address this challenge, we explore fronting a reasoning model with a large non-reasoning model. We call this modified human-in-the-loop system…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.