Reinforcement Learning without Human Feedback for Last Mile Fine-Tuning   of Large Language Models

Alec Solway

arXiv:2408.16753·cs.CL·August 30, 2024

Reinforcement Learning without Human Feedback for Last Mile Fine-Tuning of Large Language Models

Alec Solway

PDF

Open Access

TL;DR

This paper introduces a reinforcement learning framework for last-mile fine-tuning of large language models, improving performance over traditional likelihood maximization, especially in tasks like abstractive summarization.

Contribution

It develops a novel reinforcement learning-based approach for last-mile fine-tuning, demonstrating its effectiveness beyond likelihood maximization in language model optimization.

Findings

01

Reinforcement learning outperforms likelihood maximization in raw prediction quality.

02

The performance gap can be reduced with post-processing of likelihood outputs.

03

Framework is adaptable to penalize undesirable outputs like hallucinations.

Abstract

Reinforcement learning is used to align language models with human preference signals after first pre-training the model to predict the next token of text within a large corpus using likelihood maximization. Before being deployed in a specific domain, models are often further fine-tuned on task specific data. Since human preferences are often unavailable for the last step, it is performed using likelihood maximization as that is the typical default method. However, reinforcement learning has other advantages besides facilitating alignment to a human derived reward function. For one, whereas likelihood maximization is a form of imitation learning in which the model is trained on what to do under ideal conditions, reinforcement learning is not limited to demonstrating actions just for optimally reached states and trains a model what to do under a range of scenarios as it explores the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Robotics and Automated Systems · Speech and dialogue systems

MethodsALIGN