REA-RL: Reflection-Aware Online Reinforcement Learning for Efficient Reasoning
Hexuan Deng, Wenxiang Jiao, Xuebo Liu, Jun Rao, Min Zhang

TL;DR
This paper introduces REA-RL, a reflection-aware online reinforcement learning method that improves reasoning model efficiency by balancing reflection and response length, reducing inference costs by 36% while maintaining performance.
Contribution
It proposes a novel reflection model and a reflection reward to enhance online RL for reasoning, achieving better efficiency without performance loss.
Findings
Reduces inference costs by 36%
Maintains or improves reasoning performance
Balances reflection frequency across problem difficulties
Abstract
Large Reasoning Models (LRMs) demonstrate strong performance in complex tasks but often face the challenge of overthinking, leading to substantially high inference costs. Existing approaches synthesize shorter reasoning responses for LRMs to learn, but are inefficient for online usage due to the time-consuming data generation and filtering processes. Meanwhile, online reinforcement learning mainly adopts a length reward to encourage short reasoning responses, but it tends to lose reflection ability and harm performance. To address these issues, we propose REA-RL, which introduces a small reflection model for efficient scaling in online training, offering both parallel sampling and sequential revision. Besides, a reflection reward is designed to further prevent LRMs from favoring short yet non-reflective responses. Experiments show that both methods maintain or enhance performance while…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper targets an important and practical issue—over-reflection in large reasoning models. 2. The proposed REA-RL framework is conceptually clear. 3. The writing is easy to follow. 4. The experiments demonstrate the effectiveness of the proposed methods.
1. The paper evaluates only on the MATH domain, and does not include experiments in other reasoning domains (e.g., code, general QA, agentic tasks), which limits the demonstrated generality of the approach. 2. All experiments are conducted solely on R1-Qwen-7B; including additional model families and scales would strengthen the empirical evidence and show broader applicability. 3. The method introduces extra computational overhead, requiring double rollouts and an additional reflection-model
The work is original in how it operationalizes reflection within online RL: rather than only rewarding short outputs, it explicitly detects and trims overthinking in-situ and then optimizes on both original and revised trajectories. This integration of parallel sampling with sequential revision is conceptually clean, computationally practical, and linked to an interpretable partial-advantage view that clarifies why overthinking tokens receive targeted penalties while preserving valid reasoning.
The reliance on answer-presence detection as the stopping criterion risks truncating useful verification steps when the answer is mentioned early in a speculative way, and while the authors mitigate this with a trained reflection model, there is limited quantitative reporting on its detection precision/recall beyond downstream accuracy and token ratios. The study focuses on math word problems with a single distilled 7B base; it remains unclear whether the approach scales to other reasoning dom
1. The paper presents an original and well-motivated idea of integrating reflection awareness into reinforcement learning to mitigate overthinking in reasoning models. 2. The approach is clearly explained, with intuitive motivation, and supporting experiments and ablations. 3. Experimental results are consistent and persuasive, demonstrating significant reductions in reasoning length while maintaining accuracy.
1. The revision model design seems questionable. The paper reports that using the revision model alone achieves even better results than revision model + gold answer, which is counter-intuitive. Since matching the correct answer is not a difficult task, this suggests that the revision mechanism may not be well aligned with correctness or may overfit to surface reflection patterns. Clarifying why this happens would strengthen the paper. 2. The reflection reward based on reflection-token density m
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Stream Mining Techniques · Reinforcement Learning in Robotics
