Generalized Distributional Alignment Games for Unbiased Answer-Level Fine-Tuning
Mehryar Mohri, Jon Schneider, Yutao Zhong

TL;DR
This paper introduces a bias-free, variance-optimized framework for Answer-Level Fine-Tuning using generalized distributional alignment games, improving training stability and efficiency.
Contribution
It generalizes the alignment game to Bregman divergences, develops unbiased estimators, and proposes a variance-reduction technique for faster, more stable training.
Findings
Unbiased estimators for polynomial rewards using U-statistics.
Optimal minimax polynomial estimator for KL divergence game.
Proposed AQP estimator accelerates convergence and reduces variance.
Abstract
The Distributional Alignment Game framework provides a powerful variational perspective on Answer-Level Fine-Tuning (ALFT). However, standard algorithms for these games rely on estimating logarithmic rewards from small batches, introducing a systematic bias due to Jensen's inequality that can destabilize training. In this paper, we systematically resolve this structural estimation bias. First, we generalize the alignment game to arbitrary Bregman divergences, showing that for a family of geometries inducing polynomial rewards, we can construct provably exact and unbiased estimators using U-statistics. Second, for the canonical KL divergence game where an exact solution is impossible, we derive a globally robust minimax polynomial estimator that is provably optimal, achieving the fundamental statistical error limit of , which we establish via the Ditzian-Totik theorem.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
