DialogueSidon: Recovering Full-Duplex Dialogue Tracks from In-the-Wild Dialogue Audio

Wataru Nakata; Yuki Saito; Kazuki Yamauchi; Emiru Tsunoo; Hiroshi Saruwatari

arXiv:2604.09344·cs.SD·April 14, 2026

DialogueSidon: Recovering Full-Duplex Dialogue Tracks from In-the-Wild Dialogue Audio

Wataru Nakata, Yuki Saito, Kazuki Yamauchi, Emiru Tsunoo, Hiroshi Saruwatari

PDF

TL;DR

DialogueSidon is a novel model that restores and separates full-duplex dialogue tracks from degraded monaural audio, enhancing speech clarity and separation speed.

Contribution

It introduces a joint restoration and separation approach combining VAE and diffusion models on SSL features for in-the-wild dialogue audio.

Findings

01

Significantly improves intelligibility and separation quality.

02

Achieves faster inference compared to baseline methods.

03

Effective across English, multilingual, and in-the-wild datasets.

Abstract

Full-duplex dialogue audio, in which each speaker is recorded on a separate track, is an important resource for spoken dialogue research, but is difficult to collect at scale. Most in-the-wild two-speaker dialogue is available only as degraded monaural mixtures, making it unsuitable for systems requiring clean speaker-wise signals. We propose DialogueSidon, a model for joint restoration and separation of degraded monaural two-speaker dialogue audio. DialogueSidon combines a variational autoencoder (VAE) operates on the speech self-supervised learning (SSL) model feature, which compresses SSL model features into a compact latent space, with a diffusion-based latent predictor that recovers speaker-wise latent representations from the degraded mixture. Experiments on English, multilingual, and in-the-wild dialogue datasets show that DialogueSidon substantially improves intelligibility and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.