CoPrUS: Consistency Preserving Utterance Synthesis towards more   realistic benchmark dialogues

Sebastian Steindl; Ulrich Sch\"afer; Bernd Ludwig

arXiv:2412.07515·cs.CL·December 11, 2024

CoPrUS: Consistency Preserving Utterance Synthesis towards more realistic benchmark dialogues

Sebastian Steindl, Ulrich Sch\"afer, Bernd Ludwig

PDF

Open Access 1 Repo

TL;DR

This paper introduces CoPrUS, a method using large language models to generate realistic miscommunications in dialogue datasets, enhancing their diversity and realism for training more robust dialogue systems.

Contribution

It proposes a novel two-step LLM-based pipeline to create and repair miscommunications in dialogue datasets, addressing a gap in existing benchmark data.

Findings

01

LLMs can effectively generate realistic miscommunications.

02

The augmented dataset improves dialogue system robustness.

03

Nearly 1900 dialogues were modified and published as CoPrUS-MultiWOZ.

Abstract

Large-scale Wizard-Of-Oz dialogue datasets have enabled the training of deep learning-based dialogue systems. While they are successful as benchmark datasets, they lack certain types of utterances, which would make them more realistic. In this work, we investigate the creation of synthetic communication errors in an automatic pipeline. Based on linguistic theory, we propose and follow a simple error taxonomy. We focus on three types of miscommunications that could happen in real-world dialogues but are underrepresented in the benchmark dataset: misunderstandings, non-understandings and vaguely related questions. Our two-step approach uses a state-of-the-art Large Language Model (LLM) to first create the error and secondly the repairing utterance. We perform Language Model-based evaluation to ensure the quality of the generated utterances. We apply the method to the MultiWOZ dataset and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sebastian-steindl/CoPrUS_data
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Natural Language Processing Techniques · Topic Modeling

MethodsFocus