Improving alignment of dialogue agents via targeted human judgements

Amelia Glaese; Nat McAleese; Maja Tr\k{e}bacz; John Aslanides; Vlad; Firoiu; Timo Ewalds; Maribeth Rauh; Laura Weidinger; Martin Chadwick; Phoebe; Thacker; Lucy Campbell-Gillingham; Jonathan Uesato; Po-Sen Huang; Ramona; Comanescu; Fan Yang; Abigail See; Sumanth Dathathri; Rory Greig; Charlie; Chen; Doug Fritz; Jaume Sanchez Elias; Richard Green; So\v{n}a Mokr\'a,; Nicholas Fernando; Boxi Wu; Rachel Foley; Susannah Young; Iason Gabriel,; William Isaac; John Mellor; Demis Hassabis; Koray Kavukcuoglu; Lisa Anne; Hendricks; Geoffrey Irving

arXiv:2209.14375·cs.LG·September 30, 2022·131 cites

Improving alignment of dialogue agents via targeted human judgements

Amelia Glaese, Nat McAleese, Maja Tr\k{e}bacz, John Aslanides, Vlad, Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, Phoebe, Thacker, Lucy Campbell-Gillingham, Jonathan Uesato, Po-Sen Huang, Ramona, Comanescu, Fan Yang, Abigail See, Sumanth Dathathri

PDF

Open Access 1 Video

TL;DR

This paper introduces Sparrow, a dialogue agent trained with human feedback and rule-based guidance to be more helpful, correct, and harmless, demonstrating improved factual accuracy and resilience to adversarial probing.

Contribution

The paper presents a novel approach combining rule-based guidance and evidence provision in reinforcement learning from human feedback for dialogue agents.

Findings

01

Sparrow provides evidence supporting factual claims 78% of the time.

02

Sparrow is preferred over baselines in human evaluations.

03

Sparrow violates rules only 8% of the time under adversarial probing.

Abstract

We present Sparrow, an information-seeking dialogue agent trained to be more helpful, correct, and harmless compared to prompted language model baselines. We use reinforcement learning from human feedback to train our models with two new additions to help human raters judge agent behaviour. First, to make our agent more helpful and harmless, we break down the requirements for good dialogue into natural language rules the agent should follow, and ask raters about each rule separately. We demonstrate that this breakdown enables us to collect more targeted human judgements of agent behaviour and allows for more efficient rule-conditional reward models. Second, our agent provides evidence from sources supporting factual claims when collecting preference judgements over model statements. For factual questions, evidence provided by Sparrow supports the sampled response 78% of the time.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

ChatGPT vs Sparrow - Battle of Chatbots· youtube

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems