DiaSynth: Synthetic Dialogue Generation Framework for Low Resource Dialogue Applications
Sathya Krishnan Suresh, Wu Mengjun, Tushar Pranav, Eng Siong Chng

TL;DR
DiaSynth is a novel framework that leverages Large Language Models and Chain of Thought reasoning to generate high-quality synthetic dialogues across various domains, addressing data scarcity in dialogue system development.
Contribution
It introduces a synthetic dialogue generation framework using LLMs and CoT reasoning, outperforming traditional data collection methods and capturing most of the in-domain data performance.
Findings
Synthetic data improves dialogue summarization performance by 16.47%.
Synthetic data captures 90.48% of in-domain data performance.
Larger LLMs (8B) produce higher quality synthetic dialogues.
Abstract
The scarcity of domain-specific dialogue datasets limits the development of dialogue systems across applications. Existing research is constrained by general or niche datasets that lack sufficient scale for training dialogue systems. To address this gap, we introduce DiaSynth - a synthetic dialogue generation framework capable of generating high-quality, contextually rich dialogues across a wide range of domains. Unlike existing frameworks, DiaSynth uses Large Language Models (LLMs) and Chain of Thought (CoT) reasoning to generate dynamic, domain-specific dialogues with simulated personas and diverse conversational features. We perform our experiments by generating synthetic data using different LLMs and few-shot examples from DialogSum and SAMSum. The pretrained language models fine-tuned on the synthetic data outperform the base models by 16.47% on dialogue summarization, while the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech and dialogue systems · Topic Modeling · AI in Service Interactions
MethodsBalanced Selection
