Unified Autoregressive Modeling for Joint End-to-End Multi-Talker Overlapped Speech Recognition and Speaker Attribute Estimation
Ryo Masumura, Daiki Okamura, Naoki Makishima, Mana Ihori, Akihiko, Takashima, Tomohiro Tanaka, Shota Orihashi

TL;DR
This paper introduces a unified autoregressive model that simultaneously performs multi-talker overlapped speech recognition and estimates speaker attributes like gender and age, improving performance by integrating speaker information.
Contribution
It proposes a novel transformer-based autoregressive approach that jointly models speech transcription and speaker attributes in an end-to-end manner, addressing limitations of previous methods.
Findings
Improved recognition accuracy in overlapped speech scenarios.
Effective joint modeling of speech and speaker attributes.
Demonstrated benefits on Japanese multi-talker ASR tasks.
Abstract
In this paper, we present a novel modeling method for single-channel multi-talker overlapped automatic speech recognition (ASR) systems. Fully neural network based end-to-end models have dramatically improved the performance of multi-taker overlapped ASR tasks. One promising approach for end-to-end modeling is autoregressive modeling with serialized output training in which transcriptions of multiple speakers are recursively generated one after another. This enables us to naturally capture relationships between speakers. However, the conventional modeling method cannot explicitly take into account the speaker attributes of individual utterances such as gender and age information. In fact, the performance deteriorates when each speaker is the same gender or is close in age. To address this problem, we propose unified autoregressive modeling for joint end-to-end multi-talker overlapped…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
