Getting More Juice Out of the SFT Data: Reward Learning from Human   Demonstration Improves SFT for LLM Alignment

Jiaxiang Li; Siliang Zeng; Hoi-To Wai; Chenliang Li; Alfredo Garcia,; Mingyi Hong

arXiv:2405.17888·cs.AI·October 29, 2024·1 cites

Getting More Juice Out of the SFT Data: Reward Learning from Human Demonstration Improves SFT for LLM Alignment

Jiaxiang Li, Siliang Zeng, Hoi-To Wai, Chenliang Li, Alfredo Garcia,, Mingyi Hong

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a novel IRL-based method for supervised fine-tuning of large language models, improving alignment by leveraging reward learning from human demonstrations throughout the training process.

Contribution

It proposes an IRL-based approach for SFT that enhances robustness and efficiency, connecting it with Self-Play Fine-tune methods and demonstrating superior empirical performance.

Findings

01

Significant performance improvements over existing SFT methods.

02

Effective alignment of 1B and 7B models on benchmark tasks.

03

Robustness to low-quality supervised data.

Abstract

Aligning human preference and value is an important requirement for contemporary foundation models. State-of-the-art techniques such as Reinforcement Learning from Human Feedback (RLHF) often consist of two stages: 1) supervised fine-tuning (SFT), where the model is fine-tuned by learning from human demonstration data; 2) Preference learning, where preference data is used to learn a reward model, which is in turn used by a reinforcement learning (RL) step to fine-tune the model. Such reward model serves as a proxy to human preference, and it is critical to guide the RL step towards improving the model quality. In this work, we argue that the SFT stage significantly benefits from learning a reward model as well. Instead of using the human demonstration data directly via supervised learning, we propose to leverage an Inverse Reinforcement Learning (IRL) technique to simultaneously build…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jasonjiaxiangli/reward_learning_sft
pytorchOfficial

Videos

Getting More Juice Out of the SFT Data: Reward Learning from Human Demonstration Improves SFT for LLM Alignment· slideslive

Taxonomy

TopicsArtificial Intelligence in Law · Natural Language Processing Techniques

MethodsALIGN · Shrink and Fine-Tune