Robust Reinforcement Learning from Corrupted Human Feedback
Alexander Bukharin, Ilgee Hong, Haoming Jiang, Zichong Li, Qingru, Zhang, Zixuan Zhang, Tuo Zhao

TL;DR
This paper introduces R^3M, a robust reinforcement learning from human feedback method that effectively handles corrupted preference data by modeling outliers, improving reward learning robustness in AI systems.
Contribution
The paper proposes R^3M, a novel robust RLHF approach that models corrupted labels as outliers and provides theoretical guarantees for reward recovery and outlier detection.
Findings
R^3M improves robustness in robotic control tasks.
R^3M enhances natural language generation with large language models.
The method effectively identifies outliers in preference data.
Abstract
Reinforcement learning from human feedback (RLHF) provides a principled framework for aligning AI systems with human preference data. For various reasons, e.g., personal bias, context ambiguity, lack of training, etc, human annotators may give incorrect or inconsistent preference labels. To tackle this challenge, we propose a robust RLHF approach -- , which models the potentially corrupted preference label as sparse outliers. Accordingly, we formulate the robust reward learning as an -regularized maximum likelihood estimation problem. Computationally, we develop an efficient alternating optimization algorithm, which only incurs negligible computational overhead compared with the standard RLHF approach. Theoretically, we prove that under proper regularity conditions, can consistently learn the underlying reward and identify outliers, provided that the number of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
