Extended Graph Temporal Classification for Multi-Speaker End-to-End ASR
Xuankai Chang, Niko Moritz, Takaaki Hori, Shinji Watanabe, Jonathan Le, Roux

TL;DR
This paper introduces an extended Graph Temporal Classification method for multi-speaker end-to-end automatic speech recognition, enabling more effective modeling of speaker and label transitions, with promising experimental results.
Contribution
The paper extends GTC to model both label and transition posteriors, applying it to multi-speaker ASR to unify multi- and single-speaker modeling approaches.
Findings
Achieved performance close to classical benchmarks on simulated multi-speaker data.
Extended GTC effectively models speaker transitions and label sequences.
Demonstrated the applicability of GTC-e to complex multi-speaker scenarios.
Abstract
Graph-based temporal classification (GTC), a generalized form of the connectionist temporal classification loss, was recently proposed to improve automatic speech recognition (ASR) systems using graph-based supervision. For example, GTC was first used to encode an N-best list of pseudo-label sequences into a graph for semi-supervised learning. In this paper, we propose an extension of GTC to model the posteriors of both labels and label transitions by a neural network, which can be applied to a wider range of tasks. As an example application, we use the extended GTC (GTC-e) for the multi-speaker speech recognition task. The transcriptions and speaker information of multi-speaker speech are represented by a graph, where the speaker information is associated with the transitions and ASR outputs with the nodes. Using GTC-e, multi-speaker ASR modelling becomes very similar to single-speaker…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Topic Modeling
