Extended Graph Temporal Classification for Multi-Speaker End-to-End ASR

Xuankai Chang; Niko Moritz; Takaaki Hori; Shinji Watanabe; Jonathan Le; Roux

arXiv:2203.00232·cs.SD·March 2, 2022

Extended Graph Temporal Classification for Multi-Speaker End-to-End ASR

Xuankai Chang, Niko Moritz, Takaaki Hori, Shinji Watanabe, Jonathan Le, Roux

PDF

Open Access

TL;DR

This paper introduces an extended Graph Temporal Classification method for multi-speaker end-to-end automatic speech recognition, enabling more effective modeling of speaker and label transitions, with promising experimental results.

Contribution

The paper extends GTC to model both label and transition posteriors, applying it to multi-speaker ASR to unify multi- and single-speaker modeling approaches.

Findings

01

Achieved performance close to classical benchmarks on simulated multi-speaker data.

02

Extended GTC effectively models speaker transitions and label sequences.

03

Demonstrated the applicability of GTC-e to complex multi-speaker scenarios.

Abstract

Graph-based temporal classification (GTC), a generalized form of the connectionist temporal classification loss, was recently proposed to improve automatic speech recognition (ASR) systems using graph-based supervision. For example, GTC was first used to encode an N-best list of pseudo-label sequences into a graph for semi-supervised learning. In this paper, we propose an extension of GTC to model the posteriors of both labels and label transitions by a neural network, which can be applied to a wider range of tasks. As an example application, we use the extended GTC (GTC-e) for the multi-speaker speech recognition task. The transcriptions and speaker information of multi-speaker speech are represented by a graph, where the speaker information is associated with the transitions and ASR outputs with the nodes. Using GTC-e, multi-speaker ASR modelling becomes very similar to single-speaker…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Topic Modeling