SuperHF: Supervised Iterative Learning from Human Feedback
Gabriel Mukobi, Peter Chatain, Su Fong, Robert Windesheim, Gitta, Kutyniok, Kush Bhatia, Silas Alberti

TL;DR
SuperHF is a new method for aligning language models that combines supervised learning with iterative human feedback, improving stability, efficiency, and safety over traditional reinforcement learning approaches.
Contribution
It introduces SuperHF, replacing PPO with supervised loss and KL divergence, enhancing model alignment, stability, and simplicity compared to existing RLHF methods.
Findings
SuperHF outperforms PPO-based RLHF on training objectives.
It reduces reward hacking and improves downstream calibration.
SuperHF is simpler to implement and effective in language model alignment.
Abstract
While large language models demonstrate remarkable capabilities, they often present challenges in terms of safety, alignment with human values, and stability during training. Here, we focus on two prevalent methods used to align these models, Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF). SFT is simple and robust, powering a host of open-source models, while RLHF is a more sophisticated method used in top-tier models like ChatGPT but also suffers from instability and susceptibility to reward hacking. We propose a novel approach, Supervised Iterative Learning from Human Feedback (SuperHF), which seeks to leverage the strengths of both methods. Our hypothesis is two-fold: that the reward model used in RLHF is critical for efficient data use and model generalization and that the use of Proximal Policy Optimization (PPO) in RLHF may not be necessary and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Explainable Artificial Intelligence (XAI)
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Position-Wise Feed-Forward Layer · Entropy Regularization · Label Smoothing · Residual Connection · Byte Pair Encoding · Dropout
