ReFlect: An Effective Harness System for Complex Long-Horizon LLM Reasoning
Fan Huang

TL;DR
ReFlect is a model-agnostic, inference-time harness system that improves long-horizon reasoning accuracy of LLMs by creating error detection and recovery logic, significantly reducing silent errors in multi-stage tasks.
Contribution
It introduces ReFlect, a deterministic wrapper for LLMs that enhances reasoning reliability through standalone error detection and recovery, outperforming existing paradigms.
Findings
ReFlect achieves up to 56% task success across six models.
It raises SWE-bench patch-structural quality from 0% to over 80%.
Harness gains are inversely proportional to baseline success rates.
Abstract
Current reasoning paradigms for LLMs include chain-of-thought, ReAct, and post-hoc self-critique. These paradigms rely on two assumptions that fail on long-horizon, multi-stage tasks. As a result, errors accumulate silently across reasoning steps, leaving an open question: can a reasoning system effectively detect and recover from its own failures? We present ReFlect, a \emph{harness} system for LLM reasoning that creates standalone error detection and recovery logic as a deterministic wrapper around the model. Controlled experiments across 6 reasoning domains show that prompt-level self-critique produces formulaic templates that flag no issues in 90 of 100 audited reflection blocks, and the investigated LLMs wrongly accept a wrong answer in at least 76\% of cases. Our ReFlect harness achieves task success rates ranging from 41\% on gpt-4o-mini to 56\% on Claude Sonnet 4.5 across six…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
