Recipes for Safety in Open-domain Chatbots
Jing Xu, Da Ju, Margaret Li, Y-Lan Boureau, Jason Weston, Emily Dinan

TL;DR
This paper presents new methods for training and evaluating open-domain chatbots to reduce offensive behavior and biases, while maintaining engagingness, through a human-in-the-loop framework and internal safety distillation.
Contribution
Introduces a novel human-and-model-in-the-loop framework and a safety distillation method that do not rely on external classifiers, improving safety in dialogue models.
Findings
Models are safer according to automatic and human evaluations.
Safety methods maintain engagingness comparable to state-of-the-art.
Analysis of failure cases highlights current limitations.
Abstract
Models trained on large unlabeled corpora of human interactions will learn patterns and mimic behaviors therein, which include offensive or otherwise toxic behavior and unwanted biases. We investigate a variety of methods to mitigate these issues in the context of open-domain generative dialogue models. We introduce a new human-and-model-in-the-loop framework for both training safer models and for evaluating them, as well as a novel method to distill safety considerations inside generative models without the use of an external classifier at deployment time. We conduct experiments comparing these methods and find our new techniques are (i) safer than existing models as measured by automatic and human evaluations while (ii) maintaining usability metrics such as engagingness relative to the state of the art. We then discuss the limitations of this work by analyzing failure cases of our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Topic Modeling · AI in Service Interactions
