ADORA: Training Reasoning Models with Dynamic Advantage Estimation on Reinforcement Learning
Qingnan Ren, Shiting Huang, Zhen Fang, Zehui Chen, Lin Chen, Lijun Li, Feng Zhao

TL;DR
ADORA introduces a dynamic advantage estimation framework that adaptively prioritizes training samples based on their utility, leading to more efficient policy updates and improved reasoning performance in reinforcement learning models.
Contribution
This paper presents ADORA, a novel method for dynamically adjusting advantage estimation during training, enhancing policy optimization without architectural changes.
Findings
Significantly improves reasoning accuracy in geometric and mathematical tasks.
Achieves faster convergence and more stable learning.
Effective across diverse model architectures and data scales.
Abstract
Reinforcement learning has become a cornerstone technique for developing reasoning models in complex tasks, ranging from mathematical problem-solving to imaginary reasoning. The optimization of these models typically relies on policy gradient methods, whose efficacy hinges on the accurate estimation of an advantage function. However, prevailing methods typically employ static advantage estimation, a practice that leads to inefficient credit assignment by neglecting the dynamic utility of training samples over time. This limitation results in suboptimal policy updates, which in turn manifest as slower convergence rates and increased learning instability, as models fail to adapt to evolving sample utilities effectively. To address this problem, we introduce \textbf{ADORA} (\textbf{A}dvantage \textbf{D}ynamics via \textbf{O}nline \textbf{R}ollout \textbf{A}daptation), a novel framework for…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
Originality: The dynamic reweighting of advantages based on online rollout statistics is a practical contribution. Quality: Strong experimental validation across both VLMs (MathVista: 73.5%) and LLMs (3.5% average improvement over GRPO). Clarity: Well-structured presentation with clear motivation. And figures effectively illustrate training dynamics and sample evolution. Significance: Addresses a real problem in RL-based reasoning training.
1. Limited experiment scope: * Only tested on Qwen family models. Generalization to other model family (Gemma, Llama, Phi, etc.) is unknown. * VLM experiments use only 2K samples—unclear if benefits persist at larger scales. 2. Incomplete ablations: * No ablation on weight values (see Question#1 and #2). * Figure 2 shows ablations on advantage criteria but only for LLMs—missing VLM ablations. * The threshold τ=0.5 for difficulty appears arbitrary—no ablation on this critical hyperparameter. *
### Strengths: 1. ADORA categorizes the training data into temporarily advantageous and disadvantageous samples and adaptively assigns sample-wise weights based on predefined criteria to estimate the ultimate advantage. 2. Validation and ablations are conducted across different domains and datasets, especially on both LLMs and VLMs. Further, ADORA achieves a consistent performance gain to an extent.
Weaknesses: 1. The paper only consider length and success rate to measure the utility. It does not consider more complex evaluations such as step consistency. I believe there is opportunity to define the criteria more comprehensively. 2. The authors mention "How to assign a corresponding weight $w_s$ that reflects its training utility?". However, I don't see an appropriate answer to this question. The rationale behind choosing the specific values is not discussed. I would consider them as hyp
The paper investigate the important topic of sample efficiency for LLMs for hard reasoning tasks. The proposed heuristics show promising results to improve the sample efficiency of GRPO without any additiona significant computational cost. The benchmark results are supported with more qualitative analysis.
The method introduces several key hyperparameters ($\tau=0.5$, $w_s=0.1$, $w_s=2.0$) that are presented without any justification or sensitivity analysis. These values are likely to heavily influence performance and would almost certainly require re-tuning for new models or tasks thus undermining the paper's central claim of being a general and lightweight approach. This weakness is compounded by the fact that all experiments are limited to a single model family (Qwen). The heuristics are calcu
- Feasible approach to dynamic weighting: The paper introduces a conceptually clean yet empirical effective way to dynamically calibrate the advantage function during reinforcement learning. It targets solving a key problem with static estimations. - Strong empirical validation: The experiments are relative extensive, covering both LLMs and VLMs. It shows clear and consistent performance improvements even with limited data. This gives confidence in the robustness of ADORA’s approach. However,
- Generalization not deeply explored: Although ADORA performs well on tested benchmarks, the discussion on transferability and performance on out-of-distribution or unseen domains feels somewhat limited. Also, the weighting is based on heuristic rule and there lacks some theoretical insights. - Dependence on rollout quality: The authors themselves note that ADORA’s success depends on the quality of generated rollouts (Appendix D). However, the paper does not clearly propose methods to mitigate l
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Machine Learning and Data Classification · Domain Adaptation and Few-Shot Learning
