RVPO: Risk-Sensitive Alignment via Variance Regularization
Ivan Montero, Tomasz Jurczyk, Bhuwan Dhingra

TL;DR
RVPO introduces a variance regularization technique in RLHF to improve multi-objective alignment by promoting reward consistency, effectively preventing constraint neglect in large language models.
Contribution
The paper proposes RVPO, a novel risk-sensitive framework that penalizes reward variance to enhance multi-objective alignment in large language models.
Findings
RVPO improves scores on HealthBench compared to GDPO.
RVPO maintains competitive accuracy on GPQA-Diamond.
Variance regularization mitigates constraint neglect across model scales.
Abstract
Current critic-less RLHF methods aggregate multi-objective rewards via an arithmetic mean, leaving them vulnerable to constraint neglect: high-magnitude success in one objective can numerically offset critical failures in others (e.g., safety or formatting), masking low-performing "bottleneck" rewards vital for reliable multi-objective alignment. We propose Reward-Variance Policy Optimization (RVPO), a risk-sensitive framework that penalizes inter-reward variance during advantage aggregation, shifting the objective from "maximize sum" to "maximize consistency." We show via Taylor expansion that a LogSumExp (SoftMin) operator effectively acts as a smooth variance penalty. We evaluate RVPO on rubric-based medical and scientific reasoning with up to 17 concurrent LLM-judged reward signals (Qwen2.5-3B/7B/14B) and on tool-calling with rule-based constraints (Qwen2.5-1.5B/3B). By preventing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
