Building Open-Retrieval Conversational Question Answering Systems by Generating Synthetic Data and Decontextualizing User Questions

Christos Vlachos; Nikolaos Stylianou; Alexandra Fiotaki; Spiros Methenitis; Elisavet Palogiannidi; Themos Stafylakis; Ion Androutsopoulos

arXiv:2507.04884·cs.CL·July 8, 2025

Building Open-Retrieval Conversational Question Answering Systems by Generating Synthetic Data and Decontextualizing User Questions

Christos Vlachos, Nikolaos Stylianou, Alexandra Fiotaki, Spiros Methenitis, Elisavet Palogiannidi, Themos Stafylakis, Ion Androutsopoulos

PDF

TL;DR

This paper presents a method to automatically generate synthetic open-retrieval conversational QA datasets from plain text documents, enabling training of dialog systems that understand context and grounding without extensive manual annotation.

Contribution

The authors introduce a pipeline for creating realistic synthetic OR-CONVQA data from organizational documents, facilitating training of question rewriters and retrieval-based systems.

Findings

01

Synthetic dialogs improve training efficiency for question rewriters.

02

Decontextualized questions enable the use of existing retrievers.

03

Generated datasets mimic real-world human-annotated dialogs.

Abstract

We consider open-retrieval conversational question answering (OR-CONVQA), an extension of question answering where system responses need to be (i) aware of dialog history and (ii) grounded in documents (or document fragments) retrieved per question. Domain-specific OR-CONVQA training datasets are crucial for real-world applications, but hard to obtain. We propose a pipeline that capitalizes on the abundance of plain text documents in organizations (e.g., product documentation) to automatically produce realistic OR-CONVQA dialogs with annotations. Similarly to real-world humanannotated OR-CONVQA datasets, we generate in-dialog question-answer pairs, self-contained (decontextualized, e.g., no referring expressions) versions of user questions, and propositions (sentences expressing prominent information from the documents) the system responses are grounded in. We show how the synthetic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.