WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano; Jacob Hilton; Suchir Balaji; Jeff Wu; Long Ouyang,; Christina Kim; Christopher Hesse; Shantanu Jain; Vineet Kosaraju; William; Saunders; Xu Jiang; Karl Cobbe; Tyna Eloundou; Gretchen Krueger; Kevin; Button; Matthew Knight; Benjamin Chess; John Schulman

arXiv:2112.09332·cs.CL·June 3, 2022·33 cites

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang,, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William, Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin, Button, Matthew Knight, Benjamin Chess, John Schulman

PDF

Open Access 2 Repos 2 Models 2 Datasets

TL;DR

WebGPT enhances question-answering by fine-tuning GPT-3 with web browsing, human feedback, and reference collection, resulting in more accurate and human-preferred answers on Reddit's ELI5 dataset.

Contribution

The paper introduces a novel method combining web browsing, imitation learning, and human feedback to improve long-form question-answering models.

Findings

01

Model answers are preferred 56% of the time over human demonstrators.

02

Model answers are preferred 69% of the time over top Reddit answers.

03

Web browsing improves factual accuracy and answer quality.

Abstract

We fine-tune GPT-3 to answer long-form questions using a text-based web-browsing environment, which allows the model to search and navigate the web. By setting up the task so that it can be performed by humans, we are able to train models on the task using imitation learning, and then optimize answer quality with human feedback. To make human evaluation of factual accuracy easier, models must collect references while browsing in support of their answers. We train and evaluate our models on ELI5, a dataset of questions asked by Reddit users. Our best model is obtained by fine-tuning GPT-3 using behavior cloning, and then performing rejection sampling against a reward model trained to predict human preferences. This model's answers are preferred by humans 56% of the time to those of our human demonstrators, and 69% of the time to the highest-voted answer from Reddit.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

Methods{Dispute@FaQ-s}How to file a dispute with Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Cosine Annealing · Attention Dropout · Dense Connections · Refunds@Expedia|||How do I get a full refund from Expedia? · Adam · Linear Warmup With Cosine Annealing