Constructing Highly Inductive Contexts for Dialogue Safety through Controllable Reverse Generation
Zhexin Zhang, Jiale Cheng, Hao Sun, Jiawen Deng, Fei Mi, Yasheng Wang,, Lifeng Shang, Minlie Huang

TL;DR
This paper introduces a reverse generation method to create diverse, category-controlled adversarial contexts for dialogue safety testing, significantly improving safety evaluation and enhancement of pretrained dialogue models.
Contribution
The paper proposes a novel reverse generation technique for constructing highly inductive, category-controlled adversarial contexts, augmenting datasets and improving safety in dialogue models.
Findings
BAD+ dataset contains over 120K diverse contexts
BAD+ exposes safety issues in popular dialogue models
Using BAD+ improves safety of generated responses
Abstract
Large pretrained language models can easily produce toxic or biased content, which is prohibitive for practical use. In order to detect such toxic generations, existing methods rely on templates, real-world data extraction, crowdsourcing workers, or automatic generation to construct adversarial contexts that are likely to induce toxic generations. However, what type of context is more likely to induce unsafe responses is still under-explored. In this paper, we identify that context toxicity and context category (e.g., \textit{profanity}, \textit{insult}, \textit{drugs}, etc.) are two important factors to cause safety issues in response generation. Hence, we propose a method called \emph{reverse generation} to construct adversarial contexts conditioned on a given response, with the flexibility to control category, toxicity level, and inductivity of the generated contexts. Via reverse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Software Engineering Research
MethodsTest
