FBOS-RL: Feedback-Driven Bi-Objective Synergistic Reinforcement Learning
Xikai Zhang, Yongzhi Li, Likang Xiao, Yingze Zhang, Yanhua Cheng, Quan Chen, Peng Jiang, Wenjun Wu, Liu Liu

TL;DR
FBOS-RL introduces a feedback-driven bi-objective reinforcement learning framework that enhances exploration and exploitation, leading to faster training and higher performance ceilings in large-scale models.
Contribution
The paper proposes a novel feedback-guided exploration method with two synergistic objectives, improving reinforcement learning efficiency and outcomes.
Findings
FBOS-RL learns faster than GRPO and feedback-based baselines.
It attains a higher performance ceiling.
It maintains higher policy entropy and lower gradient norms during training.
Abstract
Reinforcement learning has become a cornerstone for aligning and unlocking the reasoning capabilities of large-scale models. At its core, the training loop of GRPO and its variants alternates between rollout sampling and policy update. Unlike supervised learning, where each gradient step is anchored to an explicit ground-truth target, the optimal gradient direction for updating model parameters in this setting is not known a priori; the high-quality rollouts drawn during the sampling stage therefore act as the implicit "teacher" that guides every parameter update. However, GRPO adopt a simple sampling scheme that conditions all rollouts on the same original prompt. When a task lies beyond the policy model's current capability, this sampling scheme rarely yields a high-quality rollout, leaving the policy model without a meaningful gradient direction when updating its parameters, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
