RVPO: Risk-Sensitive Alignment via Variance Regularization

Ivan Montero; Tomasz Jurczyk; Bhuwan Dhingra

arXiv:2605.05750·cs.LG·May 8, 2026

RVPO: Risk-Sensitive Alignment via Variance Regularization

Ivan Montero, Tomasz Jurczyk, Bhuwan Dhingra

PDF

TL;DR

RVPO introduces a variance regularization technique in RLHF to improve multi-objective alignment by promoting reward consistency, effectively preventing constraint neglect in large language models.

Contribution

The paper proposes RVPO, a novel risk-sensitive framework that penalizes reward variance to enhance multi-objective alignment in large language models.

Findings

01

RVPO improves scores on HealthBench compared to GDPO.

02

RVPO maintains competitive accuracy on GPQA-Diamond.

03

Variance regularization mitigates constraint neglect across model scales.

Abstract

Current critic-less RLHF methods aggregate multi-objective rewards via an arithmetic mean, leaving them vulnerable to constraint neglect: high-magnitude success in one objective can numerically offset critical failures in others (e.g., safety or formatting), masking low-performing "bottleneck" rewards vital for reliable multi-objective alignment. We propose Reward-Variance Policy Optimization (RVPO), a risk-sensitive framework that penalizes inter-reward variance during advantage aggregation, shifting the objective from "maximize sum" to "maximize consistency." We show via Taylor expansion that a LogSumExp (SoftMin) operator effectively acts as a smooth variance penalty. We evaluate RVPO on rubric-based medical and scientific reasoning with up to 17 concurrent LLM-judged reward signals (Qwen2.5-3B/7B/14B) and on tool-calling with rule-based constraints (Qwen2.5-1.5B/3B). By preventing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.