Structure from Strategic Interaction & Uncertainty: Risk Sensitive Games for Robust Preference Learning
Max Horwitz, Jake Gonzales, Eric Mazumdar, Lillian J. Ratliff

TL;DR
This paper introduces risk-sensitive preference games for robust policy learning in large language models, addressing tail behavior and systematic failures by optimizing convex risk measures.
Contribution
It develops a novel risk-sensitive game framework that retains convergence properties and provides algorithms with stability and sample complexity guarantees.
Findings
Risk-adjusted policies improve robustness across data strata.
The hierarchical game and bias correction method converge to the Stackelberg equilibrium.
Empirical results show risk-sensitive policies outperform risk-neutral ones in stability and robustness.
Abstract
A growing line of work reframes preference-based fine-tuning of large language models game-theoretically: Nash Learning from Human Feedback (NLHF) recasts the problem as a zero-sum game over policies. However, optimization is over expected pairwise payoffs, thereby conflating policies with similar win rates but different tail behavior. As such, these methods are agnostic to where in the data distribution they succeed or fail: strong average performance can mask systematic failure across prompts, annotators, or safety-critical strata. We introduce risk-sensitive preference games, in which players optimize convex risk measures of their preference loss, exploiting structure in preference uncertainty. While risk-sensitivity generally breaks the zero-sum structure, we show that translation invariance of many risk metrics ensures that we retain monotonicity, yielding fast convergence of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
