Revisiting joint decoding based multi-talker speech recognition with DNN   acoustic model

Martin Kocour; Kate\v{r}ina \v{Z}mol\'ikov\'a; Lucas Ondel; J\'an; \v{S}vec; Marc Delcroix; Tsubasa Ochiai; Luk\'a\v{s} Burget; Jan; \v{C}ernock\'y

arXiv:2111.00009·eess.AS·April 18, 2022

Revisiting joint decoding based multi-talker speech recognition with DNN acoustic model

Martin Kocour, Kate\v{r}ina \v{Z}mol\'ikov\'a, Lucas Ondel, J\'an, \v{S}vec, Marc Delcroix, Tsubasa Ochiai, Luk\'a\v{s} Burget, Jan, \v{C}ernock\'y

PDF

Open Access 1 Repo

TL;DR

This paper proposes a joint decoding approach for multi-talker speech recognition using DNN acoustic models that predict joint state posteriors, enabling more accurate and uncertainty-aware recognition of overlapping speech signals.

Contribution

It introduces a novel joint decoding framework with DNN acoustic models predicting joint posteriors, improving multi-talker speech recognition performance over traditional separate decoding methods.

Findings

01

Joint decoding outperforms separate decoding in mixed speech recognition.

02

DNN-based joint acoustic modeling enhances modeling power and simplifies inference.

03

Proof of concept experiments demonstrate the effectiveness of the proposed approach.

Abstract

In typical multi-talker speech recognition systems, a neural network-based acoustic model predicts senone state posteriors for each speaker. These are later used by a single-talker decoder which is applied on each speaker-specific output stream separately. In this work, we argue that such a scheme is sub-optimal and propose a principled solution that decodes all speakers jointly. We modify the acoustic model to predict joint state posteriors for all speakers, enabling the network to express uncertainty about the attribution of parts of the speech signal to the speakers. We employ a joint decoder that can make use of this uncertainty together with higher-level language information. For this, we revisit decoding algorithms used in factorial generative models in early multi-talker speech recognition systems. In contrast with these early works, we replace the GMM acoustic model with DNN,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lucasondel/markovmodels.jl
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing