Chain of Hindsight Aligns Language Models with Feedback
Hao Liu, Carmelo Sferrazza, Pieter Abbeel

TL;DR
The paper introduces Chain of Hindsight, a new method for aligning language models with human preferences by converting feedback into language sequences for fine-tuning, improving efficiency and effectiveness over prior techniques.
Contribution
It proposes a novel, easy-to-optimize feedback learning method that leverages language sequences, enabling models to learn from any feedback type and improve alignment.
Findings
Significant improvements in summarization and dialogue benchmarks.
Markedly preferred in human evaluations.
Outperforms previous alignment methods.
Abstract
Learning from human preferences is important for language models to match human needs and to align with human and social values. Prior works have achieved remarkable successes by learning from human feedback to understand and follow instructions. Nonetheless, these methods are either founded on hand-picked model generations that are favored by human annotators, rendering them inefficient in terms of data utilization and challenging to apply in general, or they depend on reinforcement learning, which often suffers from imperfect reward functions and relies on extremely challenging optimizations. In this work, we propose a novel technique, Chain of Hindsight, that is easy to optimize and can learn from any form of feedback, regardless of its polarity. Our idea is inspired by how humans learn from extensive feedback presented in the form of languages. We convert all types of feedback into…
Peer Reviews
Decision·ICLR 2024 poster
(1) The paper presents a novel technique, Chain of Hindsight (CoH), which addresses the challenge of aligning language models with human preferences and values by leveraging human feedback. (2) The approach is easy to optimize and can learn from any form of feedback, regardless of its polarity. (3) The paper provides a well-structured review of relevant literature, including prior works on learning from human feedback and language modeling.
This is a good paper. I see no reasons to reject it. Only a few comments: 1) I am confused by the illustrated examples. In the Figure 1, the prompt template uses 'a helpful answer' / 'an unhelpful answer' while in the Section 1, they are using 'Good' / 'Bad'. It would be better to be consistent. 2) Some important studies [1,2] are missing. It would be better to include them and have a discussion. [1] RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment [2] Direct Preferen
1. The Chain of Hindsight (CoH) method is a novel approach, addressing the limitations of previous methods like supervised fine-tuning and Reinforcement Learning with Human Feedback. It's innovative in using both positive and negative feedback for model training. 2. Simplicity and Scalability: The CoH method maintains the same training objective as pretraining, which simplifies the training process and enhances scalability. This is a significant advantage over more complex systems like RLHF. Th
1. Limited Scope of Testing: The paper only considers two evaluation benchmarks such as dialogue and summarization benchmarks. It is not clear how the model performs on the standard academic benchmark. Broader testing across diverse datasets and real-world scenarios would be necessary to fully validate the approach.
- CoH is well-motivated, interesting, novel, and performs well. The training methodology is simple, yields strong benefits and is easily extendible. - The experiments are thorough, several datasets/settings are explored, there are strong baselines, and human evaluation is conducted. - The analyses (e.g., w/o lang ablation, scaling) are meaningful, sound, and an impactful contribution.
1. More details needed about RLHF. At present, it's unclear why CoH outperforms RLHF. Is the trained RM ineffective at modeling preference (add RM performance to Fig3)? Or is the learning algorithm unable to leverage the RM effectively (could it be the choice of prompts used for RLHF?), in which case an alternate RM-based baseline could be considered (e.g. rejection sampling, reinforced self-training)? 2. Why do you only prompt with 'Good:' at inference time? It seems that an advantage of the
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems
MethodsALIGN
