Efficiently Aligning Language Models with Online Natural Language Feedback
Christine Ye, Joe Benton

TL;DR
This paper introduces methods for aligning language models in complex domains using online natural language feedback, significantly reducing the need for extensive expert supervision.
Contribution
It develops techniques to optimize language models with minimal expert supervision by iteratively updating proxy reward models using in-context learning and fine-tuning.
Findings
ICL methods recover up to 35% of performance with 50x fewer samples.
Fine-tuning recovers 80-100% of performance with 3-30x fewer samples.
Online natural language feedback improves data efficiency of supervision.
Abstract
Reinforcement learning with verifiable rewards has been used to elicit impressive performance from language models in many domains. But, broadly beneficial deployments of AI may require us to train models with strong capabilities in "fuzzy", hard-to-supervise domains. In this paper, we develop methods to align language models in fuzzy domains where human experts are still able to provide high-quality supervision signal, but only for a small number of model outputs, using online natural language feedback. Specifically, we train models by iteratively optimizing against proxy reward signals, stopping at the point of over-optimization, collecting fresh expert supervision, and updating the proxy reward. We construct proxy reward models from language models using in-context learning (ICL) and fine-tuning. We test our methods by eliciting creative writing and alignment research capabilities in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
