TL;DR
ReBel is a reinforcement learning algorithm that models structured belief states to improve long-horizon decision-making in partially observable environments, enhancing success rates and sample efficiency.
Contribution
ReBel introduces belief-consistency supervision and belief-aware grouping, enabling better credit assignment without external annotations in long-horizon RL tasks.
Findings
ReBel improves task success by up to 20.4 percentage points.
ReBel increases sample efficiency by 2.1 times.
ReBel outperforms episode-level baselines on challenging benchmarks.
Abstract
Reinforcement learning from verifiable rewards (RLVR) is a promising paradigm for improving large language model (LLM) agents on long-horizon interactive tasks. However, in partially observable environments, incomplete observations cause agent beliefs to drift over time, while delayed rewards obscure the causal impact of intermediate decisions, exacerbating temporal credit assignment challenges. To address this, we propose ReBel (Reward Belief), a process-level reinforcement learning algorithm that explicitly models structured belief states to summarize interaction history and guide subsequent policy learning. ReBel introduces belief-consistency supervision, converting discrepancies between predicted beliefs and observed feedback into dense self-supervised signals without requiring external step-wise annotations or verifiers. It also employs belief-aware grouping to compare trajectories…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
