Learn What NOT to Learn: Towards Generative Safety in Chatbots

Leila Khalatbari; Yejin Bang; Dan Su; Willy Chung; Saeed Ghadimi,; Hossein Sameti; Pascale Fung

arXiv:2304.11220·cs.CL·April 26, 2023·1 cites

Learn What NOT to Learn: Towards Generative Safety in Chatbots

Leila Khalatbari, Yejin Bang, Dan Su, Willy Chung, Saeed Ghadimi,, Hossein Sameti, Pascale Fung

PDF

Open Access

TL;DR

This paper introduces LOT, a contrastive learning framework that improves chatbot safety by reducing toxic outputs without compromising conversational quality, using learned safe and unsafe language distributions.

Contribution

The paper presents a novel contrastive learning approach that automatically leverages safe and unsafe language signals to enhance safety in generative chatbots while maintaining engagement.

Findings

01

Reduces toxicity by up to four times

02

Increases engagingness and fluency four to six times

03

Human evaluation confirms effectiveness

Abstract

Conversational models that are generative and open-domain are particularly susceptible to generating unsafe content since they are trained on web-based social data. Prior approaches to mitigating this issue have drawbacks, such as disrupting the flow of conversation, limited generalization to unseen toxic input contexts, and sacrificing the quality of the dialogue for the sake of safety. In this paper, we present a novel framework, named "LOT" (Learn NOT to), that employs a contrastive loss to enhance generalization by learning from both positive and negative training signals. Our approach differs from the standard contrastive learning framework in that it automatically obtains positive and negative signals from the safe and unsafe language distributions that have been learned beforehand. The LOT framework utilizes divergence to steer the generations away from the unsafe subspace and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification

MethodsContrastive Learning