Separator-Transducer-Segmenter: Streaming Recognition and Segmentation   of Multi-party Speech

Ilya Sklyar; Anna Piunova; Christian Osendorfer

arXiv:2205.05199·eess.AS·May 12, 2022

Separator-Transducer-Segmenter: Streaming Recognition and Segmentation of Multi-party Speech

Ilya Sklyar, Anna Piunova, Christian Osendorfer

PDF

Open Access

TL;DR

This paper introduces a novel streaming model for multi-party speech recognition and segmentation, integrating speech separation, recognition, and segmentation to improve accuracy and latency in multi-turn conversations.

Contribution

The work presents a new separator-transducer-segmenter model with innovative segmentation strategies, regularization techniques, and latency penalties for better multi-party speech processing.

Findings

01

Achieved 4.6% improvement in turn counting accuracy

02

Reduced word error rate by 17% on LibriCSS dataset

03

Enhanced segmentation without degrading recognition accuracy

Abstract

Streaming recognition and segmentation of multi-party conversations with overlapping speech is crucial for the next generation of voice assistant applications. In this work we address its challenges discovered in the previous work on multi-turn recurrent neural network transducer (MT-RNN-T) with a novel approach, separator-transducer-segmenter (STS), that enables tighter integration of speech separation, recognition and segmentation in a single model. First, we propose a new segmentation modeling strategy through start-of-turn and end-of-turn tokens that improves segmentation without recognition accuracy degradation. Second, we further improve both speech recognition and segmentation accuracy through an emission regularization method, FastEmit, and multi-task training with speech activity information as an additional training signal. Third, we experiment with end-of-turn emission…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems