FBOS-RL: Feedback-Driven Bi-Objective Synergistic Reinforcement Learning

Xikai Zhang; Yongzhi Li; Likang Xiao; Yingze Zhang; Yanhua Cheng; Quan Chen; Peng Jiang; Wenjun Wu; Liu Liu

arXiv:2605.20256·cs.LG·May 21, 2026

FBOS-RL: Feedback-Driven Bi-Objective Synergistic Reinforcement Learning

Xikai Zhang, Yongzhi Li, Likang Xiao, Yingze Zhang, Yanhua Cheng, Quan Chen, Peng Jiang, Wenjun Wu, Liu Liu

PDF

TL;DR

FBOS-RL introduces a feedback-driven bi-objective reinforcement learning framework that enhances exploration and exploitation, leading to faster training and higher performance ceilings in large-scale models.

Contribution

The paper proposes a novel feedback-guided exploration method with two synergistic objectives, improving reinforcement learning efficiency and outcomes.

Findings

01

FBOS-RL learns faster than GRPO and feedback-based baselines.

02

It attains a higher performance ceiling.

03

It maintains higher policy entropy and lower gradient norms during training.

Abstract

Reinforcement learning has become a cornerstone for aligning and unlocking the reasoning capabilities of large-scale models. At its core, the training loop of GRPO and its variants alternates between rollout sampling and policy update. Unlike supervised learning, where each gradient step is anchored to an explicit ground-truth target, the optimal gradient direction for updating model parameters in this setting is not known a priori; the high-quality rollouts drawn during the sampling stage therefore act as the implicit "teacher" that guides every parameter update. However, GRPO adopt a simple sampling scheme that conditions all rollouts on the same original prompt. When a task lies beyond the policy model's current capability, this sampling scheme rarely yields a high-quality rollout, leaving the policy model without a meaningful gradient direction when updating its parameters, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.