Reinforcement Learning without Human Feedback for Last Mile Fine-Tuning of Large Language Models
Alec Solway

TL;DR
This paper introduces a reinforcement learning framework for last-mile fine-tuning of large language models, improving performance over traditional likelihood maximization, especially in tasks like abstractive summarization.
Contribution
It develops a novel reinforcement learning-based approach for last-mile fine-tuning, demonstrating its effectiveness beyond likelihood maximization in language model optimization.
Findings
Reinforcement learning outperforms likelihood maximization in raw prediction quality.
The performance gap can be reduced with post-processing of likelihood outputs.
Framework is adaptable to penalize undesirable outputs like hallucinations.
Abstract
Reinforcement learning is used to align language models with human preference signals after first pre-training the model to predict the next token of text within a large corpus using likelihood maximization. Before being deployed in a specific domain, models are often further fine-tuned on task specific data. Since human preferences are often unavailable for the last step, it is performed using likelihood maximization as that is the typical default method. However, reinforcement learning has other advantages besides facilitating alignment to a human derived reward function. For one, whereas likelihood maximization is a form of imitation learning in which the model is trained on what to do under ideal conditions, reinforcement learning is not limited to demonstrating actions just for optimally reached states and trains a model what to do under a range of scenarios as it explores the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Robotics and Automated Systems · Speech and dialogue systems
MethodsALIGN
