Blockwise Advantage Estimation for Multi-Objective RL with Verifiable Rewards

Kirill Pavlenko; Alexander Golubev; Simon Karasik; Boris Yangel

arXiv:2602.10231·cs.LG·February 12, 2026

Blockwise Advantage Estimation for Multi-Objective RL with Verifiable Rewards

Kirill Pavlenko, Alexander Golubev, Simon Karasik, Boris Yangel

PDF

Open Access

TL;DR

This paper introduces Blockwise Advantage Estimation, a method for multi-objective reinforcement learning that assigns advantages to specific text blocks, reducing reward interference and improving structured generation tasks.

Contribution

It proposes a novel advantage estimation technique that assigns separate advantages to each objective within text blocks, enabling better multi-objective optimization without nested rollouts.

Findings

01

Mitigates reward interference in structured generation tasks

02

Achieves competitive performance with reward-designed approaches

03

Preserves test-time gains from confidence-weighted ensembling

Abstract

Group Relative Policy Optimization (GRPO) assigns a single scalar advantage to all tokens in a completion. For structured generations with explicit segments and objectives, this couples unrelated reward signals across segments, leading to objective interference and misattributed credit. We propose Blockwise Advantage Estimation, a family of GRPO-compatible methods that assigns each objective its own advantage and applies it only to the tokens in the corresponding text block, reducing reliance on hand-designed scalar rewards and scaling naturally to additional objectives. A key challenge is estimating advantages for later blocks whose rewards are conditioned on sampled prefixes; standard unbiased approaches require expensive nested rollouts from intermediate states. Concretely, we introduce an Outcome-Conditioned Baseline that approximates intermediate state values using only…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Artificial Intelligence in Games