Knowledge Distillation for Neural Transducer-based Target-Speaker ASR:   Exploiting Parallel Mixture/Single-Talker Speech Data

Takafumi Moriya; Hiroshi Sato; Tsubasa Ochiai; Marc Delcroix; Takanori; Ashihara; Kohei Matsuura; Tomohiro Tanaka; Ryo Masumura; Atsunori Ogawa,; Taichi Asami

arXiv:2305.15971·eess.AS·May 26, 2023·Interspeech·1 cites

Knowledge Distillation for Neural Transducer-based Target-Speaker ASR: Exploiting Parallel Mixture/Single-Talker Speech Data

Takafumi Moriya, Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Takanori, Ashihara, Kohei Matsuura, Tomohiro Tanaka, Ryo Masumura, Atsunori Ogawa,, Taichi Asami

PDF

Open Access

TL;DR

This paper introduces a knowledge distillation method for neural transducer-based target-speaker ASR that leverages parallel mixture and single-talker speech data, improving recognition accuracy without additional computational costs.

Contribution

It proposes a novel KD scheme using a pretrained RNNT to generate pseudo labels, enhancing TS-RNNT training with parallel single-talker data.

Findings

01

KD scheme improves TS-RNNT performance

02

Outperforms baseline TS-RNNT models

03

Utilizes parallel mixture and single-talker data effectively

Abstract

Neural transducer (RNNT)-based target-speaker speech recognition (TS-RNNT) directly transcribes a target speaker's voice from a multi-talker mixture. It is a promising approach for streaming applications because it does not incur the extra computation costs of a target speech extraction frontend, which is a critical barrier to quick response. TS-RNNT is trained end-to-end given the input speech (i.e., mixtures and enrollment speech) and reference transcriptions. The training mixtures are generally simulated by mixing single-talker signals, but conventional TS-RNNT training does not utilize single-speaker signals. This paper proposes using knowledge distillation (KD) to exploit the parallel mixture/single-talker speech data. Our proposed KD scheme uses an RNNT system pretrained with the target single-talker speech input to generate pseudo labels for the TS-RNNT training. Experimental…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Neural Networks and Applications

MethodsKnowledge Distillation