Robust Reinforcement Learning from Corrupted Human Feedback

Alexander Bukharin; Ilgee Hong; Haoming Jiang; Zichong Li; Qingru; Zhang; Zixuan Zhang; Tuo Zhao

arXiv:2406.15568·cs.LG·July 10, 2024

Robust Reinforcement Learning from Corrupted Human Feedback

Alexander Bukharin, Ilgee Hong, Haoming Jiang, Zichong Li, Qingru, Zhang, Zixuan Zhang, Tuo Zhao

PDF

Open Access

TL;DR

This paper introduces R^3M, a robust reinforcement learning from human feedback method that effectively handles corrupted preference data by modeling outliers, improving reward learning robustness in AI systems.

Contribution

The paper proposes R^3M, a novel robust RLHF approach that models corrupted labels as outliers and provides theoretical guarantees for reward recovery and outlier detection.

Findings

01

R^3M improves robustness in robotic control tasks.

02

R^3M enhances natural language generation with large language models.

03

The method effectively identifies outliers in preference data.

Abstract

Reinforcement learning from human feedback (RLHF) provides a principled framework for aligning AI systems with human preference data. For various reasons, e.g., personal bias, context ambiguity, lack of training, etc, human annotators may give incorrect or inconsistent preference labels. To tackle this challenge, we propose a robust RLHF approach -- $R^{3} M$ , which models the potentially corrupted preference label as sparse outliers. Accordingly, we formulate the robust reward learning as an $ℓ_{1}$ -regularized maximum likelihood estimation problem. Computationally, we develop an efficient alternating optimization algorithm, which only incurs negligible computational overhead compared with the standard RLHF approach. Theoretically, we prove that under proper regularity conditions, $R^{3} M$ can consistently learn the underlying reward and identify outliers, provided that the number of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications