Cross-Lingual Dialogue Dataset Creation via Outline-Based Generation
Olga Majewska, Evgeniia Razumovskaia, Edoardo Maria Ponti, Ivan, Vuli\'c, Anna Korhonen

TL;DR
This paper introduces a novel outline-based annotation method for creating large-scale, high-quality multilingual dialogue datasets, improving naturalness and cultural relevance over translation-based datasets, and benchmarks state-of-the-art systems on this new dataset.
Contribution
The paper presents a new outline-based annotation process for multilingual dialogue datasets, resulting in COD, a large-scale, culturally relevant dataset for cross-lingual dialogue modeling.
Findings
COD outperforms translation-based datasets in quality and naturalness
State-of-the-art systems achieve more realistic performance scores on COD
Outline-based annotation enhances dataset diversity and cultural specificity
Abstract
Multilingual task-oriented dialogue (ToD) facilitates access to services and information for many (communities of) speakers. Nevertheless, the potential of this technology is not fully realised, as current datasets for multilingual ToD - both for modular and end-to-end modelling - suffer from severe limitations. 1) When created from scratch, they are usually small in scale and fail to cover many possible dialogue flows. 2) Translation-based ToD datasets might lack naturalness and cultural specificity in the target language. In this work, to tackle these limitations we propose a novel outline-based annotation process for multilingual ToD datasets, where domain-specific abstract schemata of dialogue are mapped into natural language outlines. These in turn guide the target language annotators in writing a dialogue by providing instructions about each turn's intents and slots. Through this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech and dialogue systems · Natural Language Processing Techniques
