Multi-Utterance Speech Separation and Association Trained on Short Segments

Yuzhu Wang; Archontis Politis; Konstantinos Drossos; Tuomas Virtanen

arXiv:2507.02562·eess.AS·July 4, 2025

Multi-Utterance Speech Separation and Association Trained on Short Segments

Yuzhu Wang, Archontis Politis, Konstantinos Drossos, Tuomas Virtanen

PDF

TL;DR

This paper introduces a frequency-temporal RNN that effectively separates and associates multiple speakers in long recordings, trained on short segments but capable of handling much longer audio without segmentation artifacts.

Contribution

The proposed FTRNN model bridges the gap between training on short segments and processing long recordings, enabling robust multi-utterance speech separation and speaker association.

Findings

01

FTRNN generalizes well to longer recordings (21-121 s)

02

Model maintains speaker association across gaps exceeding training conditions

03

Lightweight model (0.9 M parameters) performs inference without segmentation

Abstract

Current deep neural network (DNN) based speech separation faces a fundamental challenge -- while the models need to be trained on short segments due to computational constraints, real-world applications typically require processing significantly longer recordings with multiple utterances per speaker than seen during training. In this paper, we investigate how existing approaches perform in this challenging scenario and propose a frequency-temporal recurrent neural network (FTRNN) that effectively bridges this gap. Our FTRNN employs a full-band module to model frequency dependencies within each time frame and a sub-band module that models temporal patterns in each frequency band. Despite being trained on short fixed-length segments of 10 s, our model demonstrates robust separation when processing signals significantly longer than training segments (21-121 s) and preserves speaker…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.