DocTalk: Scalable Graph-based Dialogue Synthesis for Enhancing LLM Conversational Capabilities

Jing Yang Lee; Hamed Bonab; Nasser Zalmout; Ming Zeng; Sanket Lokegaonkar; Colin Lockard; Binxuan Huang; Ritesh Sarkhel; Haodong Wang

arXiv:2507.05750·cs.CL·July 9, 2025

DocTalk: Scalable Graph-based Dialogue Synthesis for Enhancing LLM Conversational Capabilities

Jing Yang Lee, Hamed Bonab, Nasser Zalmout, Ming Zeng, Sanket Lokegaonkar, Colin Lockard, Binxuan Huang, Ritesh Sarkhel, Haodong Wang

PDF

Open Access 1 Datasets

TL;DR

This paper introduces DocTalk, a large-scale, graph-based dialogue dataset synthesized from Wikipedia articles, which improves LLMs' multi-turn conversational abilities by up to 40% during pre-training.

Contribution

The paper presents a novel pipeline for creating extensive multi-turn dialogue data from existing texts, enhancing LLMs' conversational skills during pre-training.

Findings

01

Up to 40% improvement in context memory and understanding

02

Synthesized dialogues do not compromise base performance

03

Effective pre-training data for multi-turn capabilities

Abstract

Large Language Models (LLMs) are increasingly employed in multi-turn conversational tasks, yet their pre-training data predominantly consists of continuous prose, creating a potential mismatch between required capabilities and training paradigms. We introduce a novel approach to address this discrepancy by synthesizing conversational data from existing text corpora. We present a pipeline that transforms a cluster of multiple related documents into an extended multi-turn, multi-topic information-seeking dialogue. Applying our pipeline to Wikipedia articles, we curate DocTalk, a multi-turn pre-training dialogue corpus consisting of over 730k long conversations. We hypothesize that exposure to such synthesized conversational structures during pre-training can enhance the fundamental multi-turn capabilities of LLMs, such as context memory and understanding. Empirically, we show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

AmazonScience/DocTalk
dataset· 238 dl
238 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Graph Neural Networks · Speech and dialogue systems