Generating Synthetic Doctor-Patient Conversations for Long-form Audio Summarization

Yanis Labrak; David Gr\"unert; S\'everin Baroudi; Jiyun Chun; Pawel Cyrta; Sergio Burdisso; Ahmed Hassoon; David Liu; Adam Rothschild; Reed Van Deusen; Petr Motlicek; Andrew Perrault; Ricard Marxer; Thomas Schaaf

arXiv:2604.06138·cs.SD·April 8, 2026

Generating Synthetic Doctor-Patient Conversations for Long-form Audio Summarization

Yanis Labrak, David Gr\"unert, S\'everin Baroudi, Jiyun Chun, Pawel Cyrta, Sergio Burdisso, Ahmed Hassoon, David Liu, Adam Rothschild, Reed Van Deusen, Petr Motlicek, Andrew Perrault, Ricard Marxer, Thomas Schaaf

PDF

1 Datasets

TL;DR

This paper introduces a synthetic data pipeline for long-form audio reasoning, focusing on doctor-patient conversations and SOAP note generation, to improve training and evaluation of audio understanding systems.

Contribution

It presents a novel multi-stage pipeline for generating synthetic long-context audio conversations and reference notes using open-weight models, addressing data scarcity and evaluation challenges.

Findings

01

Cascaded systems outperform end-to-end models on synthetic data.

02

8,800 synthetic conversations and 1.3k hours of audio are released.

03

The pipeline enables controlled evaluation of long-context audio reasoning.

Abstract

Long-context audio reasoning is underserved in both training data and evaluation. Existing benchmarks target short-context tasks, and the open-ended generation tasks most relevant to long-context reasoning pose well-known challenges for automatic evaluation. We propose a synthetic data generation pipeline designed to serve both as a training resource and as a controlled evaluation environment, and instantiate it for first-visit doctor-patient conversations with SOAP note generation as the task. The pipeline has three stages, persona-driven dialogue generation, multi-speaker audio synthesis with overlap/pause modeling, room acoustics, and sound events, and LLM-based reference SOAP note production, built entirely on open-weight models. We release 8,800 synthetic conversations with 1.3k hours of corresponding audio and reference notes. Evaluating current open-weight systems, we find that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

BeTraC/betrac-2026
dataset· 1.3k dl
1.3k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.