ReFlect: An Effective Harness System for Complex Long-Horizon LLM Reasoning

Fan Huang

arXiv:2605.05737·cs.AI·May 8, 2026

ReFlect: An Effective Harness System for Complex Long-Horizon LLM Reasoning

Fan Huang

PDF

TL;DR

ReFlect is a model-agnostic, inference-time harness system that improves long-horizon reasoning accuracy of LLMs by creating error detection and recovery logic, significantly reducing silent errors in multi-stage tasks.

Contribution

It introduces ReFlect, a deterministic wrapper for LLMs that enhances reasoning reliability through standalone error detection and recovery, outperforming existing paradigms.

Findings

01

ReFlect achieves up to 56% task success across six models.

02

It raises SWE-bench patch-structural quality from 0% to over 80%.

03

Harness gains are inversely proportional to baseline success rates.

Abstract

Current reasoning paradigms for LLMs include chain-of-thought, ReAct, and post-hoc self-critique. These paradigms rely on two assumptions that fail on long-horizon, multi-stage tasks. As a result, errors accumulate silently across reasoning steps, leaving an open question: can a reasoning system effectively detect and recover from its own failures? We present ReFlect, a \emph{harness} system for LLM reasoning that creates standalone error detection and recovery logic as a deterministic wrapper around the model. Controlled experiments across 6 reasoning domains show that prompt-level self-critique produces formulaic templates that flag no issues in 90 of 100 audited reflection blocks, and the investigated LLMs wrongly accept a wrong answer in at least 76\% of cases. Our ReFlect harness achieves task success rates ranging from 41\% on gpt-4o-mini to 56\% on Claude Sonnet 4.5 across six…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.