A Purely End-to-end System for Multi-speaker Speech Recognition
Hiroshi Seki, Takaaki Hori, Shinji Watanabe, Jonathan Le Roux, John R., Hershey

TL;DR
This paper introduces an end-to-end sequence-to-sequence system for multi-speaker speech recognition that unifies source separation and recognition, achieving significant improvements without needing additional training data.
Contribution
It presents a novel unified framework and a new objective function for directly decoding multiple speaker sequences from speech mixtures, eliminating the need for isolated source signals.
Findings
Achieves 83.1% relative improvement over baseline models.
Comparable performance to previous explicit separation-recognition systems.
Effectively learns from speech mixtures without extra training data.
Abstract
Recently, there has been growing interest in multi-speaker speech recognition, where the utterances of multiple speakers are recognized from their mixture. Promising techniques have been proposed for this task, but earlier works have required additional training data such as isolated source signals or senone alignments for effective learning. In this paper, we propose a new sequence-to-sequence framework to directly decode multiple label sequences from a single speech sequence by unifying source separation and speech recognition functions in an end-to-end manner. We further propose a new objective function to improve the contrast between the hidden vectors to avoid generating similar hypotheses. Experimental results show that the model is directly able to learn a mapping from a speech mixture to multiple label sequences, achieving 83.1 % relative improvement compared to a model trained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
