TL;DR
This paper introduces RePro, a novel method that refines LLM reasoning by viewing chain-of-thought as an optimization process, improving reasoning quality and reducing overthinking through reinforcement learning techniques.
Contribution
It proposes a new optimization-based perspective on LLM reasoning and introduces RePro, a process-level reward mechanism integrated into RLVR to enhance reasoning performance.
Findings
RePro improves reasoning accuracy across multiple benchmarks.
RePro reduces overthinking and excessively long reasoning chains.
RePro consistently outperforms baseline methods in diverse tasks.
Abstract
Recent advancements in large language models (LLMs) have been driven by their emergent reasoning capabilities, particularly through long chain-of-thought (CoT) prompting, which enables thorough exploration and deliberation. Despite these advances, long-CoT LLMs often exhibit suboptimal reasoning behaviors, such as overthinking and excessively protracted reasoning chains, which can impair performance. In this paper, we analyze reasoning processes through an optimization lens, framing CoT as a gradient descent procedure where each reasoning step constitutes an update toward problem resolution. Building on this perspective, we introduce RePro (Rectifying Process-level Reward), a novel approach to refine LLM reasoning during post-training. RePro defines a surrogate objective function to assess the optimization process underlying CoT, utilizing a dual scoring mechanism to quantify its…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
