Uncertainty Quantification for Large Language Model Reward Learning under Heterogeneous Human Feedback
Pangpang Liu, Junwei Lu, Will Wei Sun

TL;DR
This paper develops a statistical framework for quantifying uncertainty in reward models used in aligning large language models, addressing heterogeneity in human feedback with theoretical guarantees and practical algorithms.
Contribution
It introduces a heterogeneous preference model and an alternating gradient descent algorithm with proven convergence and asymptotic properties for reward estimation.
Findings
The method provides valid confidence intervals for reward estimates.
Uncertainty quantification improves reward comparison and policy selection.
Simulations and real data demonstrate practical effectiveness.
Abstract
We study estimation and statistical inference for reward models used in aligning large language models (LLMs). A key component of LLM alignment is reinforcement learning from human feedback (RLHF), where humans compare pairs of model-generated answers and their preferences are used to train a reward model. However, human feedback is inherently heterogeneous, creating significant challenges for reliable reward learning. To address this, we adopt a heterogeneous preference framework that jointly models the latent reward of answers and human rationality. This leads to a challenging biconvex optimization problem, which we solve via an alternating gradient descent algorithm. We establish theoretical guarantees for the resulting estimator, including its convergence and asymptotic distribution. These results enable the construction of confidence intervals for reward estimates. Leveraging these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Natural Language Processing Techniques
