Joint Encoder-Decoder Self-Supervised Pre-training for ASR
Arunkumar A, Umesh S

TL;DR
This paper introduces a joint encoder-decoder self-supervised pre-training approach for ASR, leveraging a multitask SSL setup to improve speech recognition accuracy by learning an acoustic unit-based language model.
Contribution
It proposes integrating a decoder into the SSL framework for ASR, which jointly optimizes encoder and decoder losses to enhance downstream speech recognition performance.
Findings
Up to 25% relative improvement on LibriSpeech ASR tasks
Joint encoder-decoder SSL outperforms HuBERT baseline
Decoder inclusion helps learn an acoustic unit-based language model
Abstract
Self-supervised learning (SSL) has shown tremendous success in various speech-related downstream tasks, including Automatic Speech Recognition (ASR). The output embeddings of the SSL model are treated as powerful short-time representations of the speech signal. However, in the ASR task, the main objective is to get the correct sequence of acoustic units, characters, or byte-pair encodings (BPEs). Usually, encoder-decoder architecture works exceptionally well for a sequence-to-sequence task like ASR. Therefore, in this paper, we propose a new paradigm that exploits the power of a decoder during self-supervised learning. We use Hidden Unit BERT (HuBERT) SSL framework to compute the conventional masked prediction loss for the encoder. In addition, we have introduced a decoder in the SSL framework and proposed a target preparation strategy for the decoder. Finally, we use a multitask SSL…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Linear Warmup With Linear Decay · Weight Decay · Layer Normalization · Refunds@Expedia|||How do I get a full refund from Expedia? · Softmax · Attention Dropout · WordPiece
