Alignment-Free Training for Transducer-based Multi-Talker ASR
Takafumi Moriya, Shota Horiguchi, Marc Delcroix, Ryo Masumura,, Takanori Ashihara, Hiroshi Sato, Kohei Matsuura, Masato Mimura

TL;DR
This paper introduces an alignment-free training method for multi-talker speech recognition using RNN Transducer models, simplifying training while maintaining competitive performance.
Contribution
It proposes a novel alignment-free training scheme for MT-RNNT that uses prompt tokens to indicate speakers, eliminating the need for complex label alignments or multiple encoders.
Findings
Achieves comparable performance to state-of-the-art methods.
Simplifies training process by removing the need for accurate alignments.
Uses only one encoder pass for recognizing all speakers.
Abstract
Extending the RNN Transducer (RNNT) to recognize multi-talker speech is essential for wider automatic speech recognition (ASR) applications. Multi-talker RNNT (MT-RNNT) aims to achieve recognition without relying on costly front-end source separation. MT-RNNT is conventionally implemented using architectures with multiple encoders or decoders, or by serializing all speakers' transcriptions into a single output stream. The first approach is computationally expensive, particularly due to the need for multiple encoder processing. In contrast, the second approach involves a complex label generation process, requiring accurate timestamps of all words spoken by all speakers in the mixture, obtained from an external ASR system. In this paper, we propose a novel alignment-free training scheme for the MT-RNNT (MT-RNNT-AFT) that adopts the standard RNNT architecture. The target labels are created…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsUltrasonics and Acoustic Wave Propagation · Fault Detection and Control Systems · Speech Recognition and Synthesis
