Conformer-based Target-Speaker Automatic Speech Recognition for   Single-Channel Audio

Yang Zhang; Krishna C. Puvvada; Vitaly Lavrukhin; Boris Ginsburg

arXiv:2308.05218·cs.SD·August 11, 2023

Conformer-based Target-Speaker Automatic Speech Recognition for Single-Channel Audio

Yang Zhang, Krishna C. Puvvada, Vitaly Lavrukhin, Boris Ginsburg

PDF

2 Repos

TL;DR

This paper introduces CONF-TSASR, a non-autoregressive end-to-end model for single-channel target-speaker speech recognition that achieves state-of-the-art results and sets new benchmarks across multiple datasets.

Contribution

The paper presents a novel CONF-TSASR architecture combining speaker embedding, masking, and ASR modules trained with CTC and spectrogram reconstruction losses.

Findings

01

Achieves 4.2% TS-WER on WSJ0-2mix-extr

02

First TS-WER results on WSJ0-3mix-extr, LibriSpeech2Mix, and LibriSpeech3Mix datasets

03

Establishes new benchmarks for target-speaker ASR

Abstract

We propose CONF-TSASR, a non-autoregressive end-to-end time-frequency domain architecture for single-channel target-speaker automatic speech recognition (TS-ASR). The model consists of a TitaNet based speaker embedding module, a Conformer based masking as well as ASR modules. These modules are jointly optimized to transcribe a target-speaker, while ignoring speech from other speakers. For training we use Connectionist Temporal Classification (CTC) loss and introduce a scale-invariant spectrogram reconstruction loss to encourage the model better separate the target-speaker's spectrogram from mixture. We obtain state-of-the-art target-speaker word error rate (TS-WER) on WSJ0-2mix-extr (4.2%). Further, we report for the first time TS-WER on WSJ0-3mix-extr (12.4%), LibriSpeech2Mix (4.2%) and LibriSpeech3Mix (7.6%) datasets, establishing new benchmarks for TS-ASR. The proposed model will be…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.