Chain of Hindsight Aligns Language Models with Feedback

Hao Liu; Carmelo Sferrazza; Pieter Abbeel

arXiv:2302.02676·cs.LG·October 19, 2023·27 cites

Chain of Hindsight Aligns Language Models with Feedback

Hao Liu, Carmelo Sferrazza, Pieter Abbeel

PDF

Open Access 3 Repos 3 Reviews

TL;DR

The paper introduces Chain of Hindsight, a new method for aligning language models with human preferences by converting feedback into language sequences for fine-tuning, improving efficiency and effectiveness over prior techniques.

Contribution

It proposes a novel, easy-to-optimize feedback learning method that leverages language sequences, enabling models to learn from any feedback type and improve alignment.

Findings

01

Significant improvements in summarization and dialogue benchmarks.

02

Markedly preferred in human evaluations.

03

Outperforms previous alignment methods.

Abstract

Learning from human preferences is important for language models to match human needs and to align with human and social values. Prior works have achieved remarkable successes by learning from human feedback to understand and follow instructions. Nonetheless, these methods are either founded on hand-picked model generations that are favored by human annotators, rendering them inefficient in terms of data utilization and challenging to apply in general, or they depend on reinforcement learning, which often suffers from imperfect reward functions and relies on extremely challenging optimizations. In this work, we propose a novel technique, Chain of Hindsight, that is easy to optimize and can learn from any form of feedback, regardless of its polarity. Our idea is inspired by how humans learn from extensive feedback presented in the form of languages. We convert all types of feedback into…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 8· accept, good paperConfidence 4

Strengths

(1) The paper presents a novel technique, Chain of Hindsight (CoH), which addresses the challenge of aligning language models with human preferences and values by leveraging human feedback. (2) The approach is easy to optimize and can learn from any form of feedback, regardless of its polarity. (3) The paper provides a well-structured review of relevant literature, including prior works on learning from human feedback and language modeling.

Weaknesses

This is a good paper. I see no reasons to reject it. Only a few comments: 1) I am confused by the illustrated examples. In the Figure 1, the prompt template uses 'a helpful answer' / 'an unhelpful answer' while in the Section 1, they are using 'Good' / 'Bad'. It would be better to be consistent. 2) Some important studies [1,2] are missing. It would be better to include them and have a discussion. [1] RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment [2] Direct Preferen

Reviewer 02Rating 8· accept, good paperConfidence 3

Strengths

1. The Chain of Hindsight (CoH) method is a novel approach, addressing the limitations of previous methods like supervised fine-tuning and Reinforcement Learning with Human Feedback. It's innovative in using both positive and negative feedback for model training. 2. Simplicity and Scalability: The CoH method maintains the same training objective as pretraining, which simplifies the training process and enhances scalability. This is a significant advantage over more complex systems like RLHF. Th

Weaknesses

1. Limited Scope of Testing: The paper only considers two evaluation benchmarks such as dialogue and summarization benchmarks. It is not clear how the model performs on the standard academic benchmark. Broader testing across diverse datasets and real-world scenarios would be necessary to fully validate the approach.

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

- CoH is well-motivated, interesting, novel, and performs well. The training methodology is simple, yields strong benefits and is easily extendible. - The experiments are thorough, several datasets/settings are explored, there are strong baselines, and human evaluation is conducted. - The analyses (e.g., w/o lang ablation, scaling) are meaningful, sound, and an impactful contribution.

Weaknesses

1. More details needed about RLHF. At present, it's unclear why CoH outperforms RLHF. Is the trained RM ineffective at modeling preference (add RM performance to Fig3)? Or is the learning algorithm unable to leverage the RM effectively (could it be the choice of prompts used for RLHF?), in which case an alternate RM-based baseline could be considered (e.g. rejection sampling, reinforced self-training)? 2. Why do you only prompt with 'Good:' at inference time? It seems that an advantage of the

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems

MethodsALIGN