Alignment-Free Training for Transducer-based Multi-Talker ASR

Takafumi Moriya; Shota Horiguchi; Marc Delcroix; Ryo Masumura,; Takanori Ashihara; Hiroshi Sato; Kohei Matsuura; Masato Mimura

arXiv:2409.20301·eess.AS·October 1, 2024

Alignment-Free Training for Transducer-based Multi-Talker ASR

Takafumi Moriya, Shota Horiguchi, Marc Delcroix, Ryo Masumura,, Takanori Ashihara, Hiroshi Sato, Kohei Matsuura, Masato Mimura

PDF

Open Access

TL;DR

This paper introduces an alignment-free training method for multi-talker speech recognition using RNN Transducer models, simplifying training while maintaining competitive performance.

Contribution

It proposes a novel alignment-free training scheme for MT-RNNT that uses prompt tokens to indicate speakers, eliminating the need for complex label alignments or multiple encoders.

Findings

01

Achieves comparable performance to state-of-the-art methods.

02

Simplifies training process by removing the need for accurate alignments.

03

Uses only one encoder pass for recognizing all speakers.

Abstract

Extending the RNN Transducer (RNNT) to recognize multi-talker speech is essential for wider automatic speech recognition (ASR) applications. Multi-talker RNNT (MT-RNNT) aims to achieve recognition without relying on costly front-end source separation. MT-RNNT is conventionally implemented using architectures with multiple encoders or decoders, or by serializing all speakers' transcriptions into a single output stream. The first approach is computationally expensive, particularly due to the need for multiple encoder processing. In contrast, the second approach involves a complex label generation process, requiring accurate timestamps of all words spoken by all speakers in the mixture, obtained from an external ASR system. In this paper, we propose a novel alignment-free training scheme for the MT-RNNT (MT-RNNT-AFT) that adopts the standard RNNT architecture. The target labels are created…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsUltrasonics and Acoustic Wave Propagation · Fault Detection and Control Systems · Speech Recognition and Synthesis