Multi-Document Grounded Multi-Turn Synthetic Dialog Generation
Young-Suk Lee, Chulaka Gunasekara, Danish Contractor, Ram\'on, Fernandez Astudillo, Radu Florian

TL;DR
This paper presents a novel method for generating multi-document grounded multi-turn synthetic dialogs using taxonomy-driven queries, retriever updates, and LLM-based filtering, improving model performance on benchmark datasets.
Contribution
It introduces a comprehensive synthetic dialog generation framework that enhances training data quality for multi-document grounded dialog systems.
Findings
Synthetic data is diverse and coherent.
Models trained on synthetic data outperform those trained on human data.
The approach improves performance across multiple benchmarks.
Abstract
We introduce a technique for multi-document grounded multi-turn synthetic dialog generation that incorporates three main ideas. First, we control the overall dialog flow using taxonomy-driven user queries that are generated with Chain-of-Thought (CoT) prompting. Second, we support the generation of multi-document grounded dialogs by mimicking real-world use of retrievers to update the grounding documents after every user-turn in the dialog. Third, we apply LLM-as-a-Judge to filter out queries with incorrect answers. Human evaluation of the synthetic dialog data suggests that the data is diverse, coherent, and includes mostly correct answers. Both human and automatic evaluations of answerable queries indicate that models fine-tuned on synthetic dialogs consistently out-perform those fine-tuned on existing human generated training data across four publicly available multi-turn document…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech and dialogue systems
