A Survey of Reinforcement Learning from Human Feedback
Timo Kaufmann, Paul Weng, Viktor Bengs, Eyke H\"ullermeier

TL;DR
This survey reviews reinforcement learning from human feedback (RLHF), highlighting its fundamentals, applications across domains including language models, control, and robotics, and recent research trends to guide future work.
Contribution
It provides a comprehensive overview of RLHF techniques, algorithms, and applications across multiple domains, emphasizing recent advances and research directions.
Findings
RLHF effectively aligns AI systems with human values.
Recent success in large language models demonstrates RLHF's potential.
Control and robotics are key domains with foundational RLHF techniques.
Abstract
Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning (RL) that learns from human feedback instead of relying on an engineered reward function. Building on prior work on the related setting of preference-based reinforcement learning (PbRL), it stands at the intersection of artificial intelligence and human-computer interaction. This positioning provides a promising approach to enhance the performance and adaptability of intelligent systems while also improving the alignment of their objectives with human values. The success in training large language models (LLMs) has impressively demonstrated this potential in recent years, where RLHF has played a decisive role in directing the model's capabilities towards human objectives. This article provides an overview of the fundamentals of RLHF, exploring how RL agents interact with human feedback. While recent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research
MethodsFocus
