Joint Encoder-Decoder Self-Supervised Pre-training for ASR

Arunkumar A; Umesh S

arXiv:2206.04465·cs.CL·June 10, 2022

Joint Encoder-Decoder Self-Supervised Pre-training for ASR

Arunkumar A, Umesh S

PDF

Open Access

TL;DR

This paper introduces a joint encoder-decoder self-supervised pre-training approach for ASR, leveraging a multitask SSL setup to improve speech recognition accuracy by learning an acoustic unit-based language model.

Contribution

It proposes integrating a decoder into the SSL framework for ASR, which jointly optimizes encoder and decoder losses to enhance downstream speech recognition performance.

Findings

01

Up to 25% relative improvement on LibriSpeech ASR tasks

02

Joint encoder-decoder SSL outperforms HuBERT baseline

03

Decoder inclusion helps learn an acoustic unit-based language model

Abstract

Self-supervised learning (SSL) has shown tremendous success in various speech-related downstream tasks, including Automatic Speech Recognition (ASR). The output embeddings of the SSL model are treated as powerful short-time representations of the speech signal. However, in the ASR task, the main objective is to get the correct sequence of acoustic units, characters, or byte-pair encodings (BPEs). Usually, encoder-decoder architecture works exceptionally well for a sequence-to-sequence task like ASR. Therefore, in this paper, we propose a new paradigm that exploits the power of a decoder during self-supervised learning. We use Hidden Unit BERT (HuBERT) SSL framework to compute the conventional masked prediction loss for the encoder. In addition, we have introduced a decoder in the SSL framework and proposed a target preparation strategy for the decoder. Finally, we use a multitask SSL…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Linear Warmup With Linear Decay · Weight Decay · Layer Normalization · Refunds@Expedia|||How do I get a full refund from Expedia? · Softmax · Attention Dropout · WordPiece