SaFeRDialogues: Taking Feedback Gracefully after Conversational Safety Failures
Megan Ung, Jing Xu, Y-Lan Boureau

TL;DR
SaFeRDialogues introduces a dataset and method for training conversational models to respond gracefully to safety feedback, improving civility without losing engagement.
Contribution
The paper presents a new dataset and fine-tuning approach enabling models to handle safety feedback more gracefully, enhancing conversational civility.
Findings
Models fine-tuned on SaFeRDialogues produce more civil responses.
Fine-tuning does not reduce engagingness or overall conversational quality.
Human raters prefer models trained with this dataset for safer, more respectful interactions.
Abstract
Current open-domain conversational models can easily be made to talk in inadequate ways. Online learning from conversational feedback given by the conversation partner is a promising avenue for a model to improve and adapt, so as to generate fewer of these safety failures. However, current state-of-the-art models tend to react to feedback with defensive or oblivious responses. This makes for an unpleasant experience and may discourage conversation partners from giving feedback in the future. This work proposes SaFeRDialogues, a task and dataset of graceful responses to conversational feedback about safety failures. We collect a dataset of 10k dialogues demonstrating safety failures, feedback signaling them, and a response acknowledging the feedback. We show how fine-tuning on this dataset results in conversations that human raters deem considerably more likely to lead to a civil…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multi-Agent Systems and Negotiation · Software Engineering Research
