Recipes for Safety in Open-domain Chatbots

Jing Xu; Da Ju; Margaret Li; Y-Lan Boureau; Jason Weston; Emily Dinan

arXiv:2010.07079·cs.CL·August 6, 2021·98 cites

Recipes for Safety in Open-domain Chatbots

Jing Xu, Da Ju, Margaret Li, Y-Lan Boureau, Jason Weston, Emily Dinan

PDF

Open Access

TL;DR

This paper presents new methods for training and evaluating open-domain chatbots to reduce offensive behavior and biases, while maintaining engagingness, through a human-in-the-loop framework and internal safety distillation.

Contribution

Introduces a novel human-and-model-in-the-loop framework and a safety distillation method that do not rely on external classifiers, improving safety in dialogue models.

Findings

01

Models are safer according to automatic and human evaluations.

02

Safety methods maintain engagingness comparable to state-of-the-art.

03

Analysis of failure cases highlights current limitations.

Abstract

Models trained on large unlabeled corpora of human interactions will learn patterns and mimic behaviors therein, which include offensive or otherwise toxic behavior and unwanted biases. We investigate a variety of methods to mitigate these issues in the context of open-domain generative dialogue models. We introduce a new human-and-model-in-the-loop framework for both training safer models and for evaluating them, as well as a novel method to distill safety considerations inside generative models without the use of an external classifier at deployment time. We conduct experiments comparing these methods and find our new techniques are (i) safer than existing models as measured by automatic and human evaluations while (ii) maintaining usability metrics such as engagingness relative to the state of the art. We then discuss the limitations of this work by analyzing failure cases of our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Topic Modeling · AI in Service Interactions