Multi-Document Grounded Multi-Turn Synthetic Dialog Generation

Young-Suk Lee; Chulaka Gunasekara; Danish Contractor; Ram\'on; Fernandez Astudillo; Radu Florian

arXiv:2409.11500·cs.CL·September 19, 2024·2 cites

Multi-Document Grounded Multi-Turn Synthetic Dialog Generation

Young-Suk Lee, Chulaka Gunasekara, Danish Contractor, Ram\'on, Fernandez Astudillo, Radu Florian

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper presents a novel method for generating multi-document grounded multi-turn synthetic dialogs using taxonomy-driven queries, retriever updates, and LLM-based filtering, improving model performance on benchmark datasets.

Contribution

It introduces a comprehensive synthetic dialog generation framework that enhances training data quality for multi-document grounded dialog systems.

Findings

01

Synthetic data is diverse and coherent.

02

Models trained on synthetic data outperform those trained on human data.

03

The approach improves performance across multiple benchmarks.

Abstract

We introduce a technique for multi-document grounded multi-turn synthetic dialog generation that incorporates three main ideas. First, we control the overall dialog flow using taxonomy-driven user queries that are generated with Chain-of-Thought (CoT) prompting. Second, we support the generation of multi-document grounded dialogs by mimicking real-world use of retrievers to update the grounding documents after every user-turn in the dialog. Third, we apply LLM-as-a-Judge to filter out queries with incorrect answers. Human evaluation of the synthetic dialog data suggests that the data is diverse, coherent, and includes mostly correct answers. Both human and automatic evaluations of answerable queries indicate that models fine-tuned on synthetic dialogs consistently out-perform those fine-tuned on existing human generated training data across four publicly available multi-turn document…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ibm/mt-rag-benchmark
none

Models

🤗
ibm-granite/granite-3.2-8b-lora-rag-citation-generation
model· 11 dl· ♡ 4
11 dl♡ 4

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech and dialogue systems