Constructing Highly Inductive Contexts for Dialogue Safety through   Controllable Reverse Generation

Zhexin Zhang; Jiale Cheng; Hao Sun; Jiawen Deng; Fei Mi; Yasheng Wang,; Lifeng Shang; Minlie Huang

arXiv:2212.01810·cs.CL·December 6, 2022

Constructing Highly Inductive Contexts for Dialogue Safety through Controllable Reverse Generation

Zhexin Zhang, Jiale Cheng, Hao Sun, Jiawen Deng, Fei Mi, Yasheng Wang,, Lifeng Shang, Minlie Huang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a reverse generation method to create diverse, category-controlled adversarial contexts for dialogue safety testing, significantly improving safety evaluation and enhancement of pretrained dialogue models.

Contribution

The paper proposes a novel reverse generation technique for constructing highly inductive, category-controlled adversarial contexts, augmenting datasets and improving safety in dialogue models.

Findings

01

BAD+ dataset contains over 120K diverse contexts

02

BAD+ exposes safety issues in popular dialogue models

03

Using BAD+ improves safety of generated responses

Abstract

Large pretrained language models can easily produce toxic or biased content, which is prohibitive for practical use. In order to detect such toxic generations, existing methods rely on templates, real-world data extraction, crowdsourcing workers, or automatic generation to construct adversarial contexts that are likely to induce toxic generations. However, what type of context is more likely to induce unsafe responses is still under-explored. In this paper, we identify that context toxicity and context category (e.g., \textit{profanity}, \textit{insult}, \textit{drugs}, etc.) are two important factors to cause safety issues in response generation. Hence, we propose a method called \emph{reverse generation} to construct adversarial contexts conditioned on a given response, with the flexibility to control category, toxicity level, and inductivity of the generated contexts. Via reverse…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thu-coai/reverse_generation
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Software Engineering Research

MethodsTest