SuperHF: Supervised Iterative Learning from Human Feedback

Gabriel Mukobi; Peter Chatain; Su Fong; Robert Windesheim; Gitta; Kutyniok; Kush Bhatia; Silas Alberti

arXiv:2310.16763·cs.CL·October 26, 2023·2 cites

SuperHF: Supervised Iterative Learning from Human Feedback

Gabriel Mukobi, Peter Chatain, Su Fong, Robert Windesheim, Gitta, Kutyniok, Kush Bhatia, Silas Alberti

PDF

Open Access 1 Repo

TL;DR

SuperHF is a new method for aligning language models that combines supervised learning with iterative human feedback, improving stability, efficiency, and safety over traditional reinforcement learning approaches.

Contribution

It introduces SuperHF, replacing PPO with supervised loss and KL divergence, enhancing model alignment, stability, and simplicity compared to existing RLHF methods.

Findings

01

SuperHF outperforms PPO-based RLHF on training objectives.

02

It reduces reward hacking and improves downstream calibration.

03

SuperHF is simpler to implement and effective in language model alignment.

Abstract

While large language models demonstrate remarkable capabilities, they often present challenges in terms of safety, alignment with human values, and stability during training. Here, we focus on two prevalent methods used to align these models, Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF). SFT is simple and robust, powering a host of open-source models, while RLHF is a more sophisticated method used in top-tier models like ChatGPT but also suffers from instability and susceptibility to reward hacking. We propose a novel approach, Supervised Iterative Learning from Human Feedback (SuperHF), which seeks to leverage the strengths of both methods. Our hypothesis is two-fold: that the reward model used in RLHF is critical for efficient data use and model generalization and that the use of Proximal Policy Optimization (PPO) in RLHF may not be necessary and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

openfeedback/superhf
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Explainable Artificial Intelligence (XAI)

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Position-Wise Feed-Forward Layer · Entropy Regularization · Label Smoothing · Residual Connection · Byte Pair Encoding · Dropout