Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo   Languages

Felix Wu; Kwangyoun Kim; Shinji Watanabe; Kyu Han; Ryan McDonald,; Kilian Q. Weinberger; Yoav Artzi

arXiv:2205.01086·cs.CL·May 3, 2022·5 cites

Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages

Felix Wu, Kwangyoun Kim, Shinji Watanabe, Kyu Han, Ryan McDonald,, Kilian Q. Weinberger, Yoav Artzi

PDF

Open Access 1 Repo

TL;DR

Wav2Seq introduces a self-supervised pre-training method for speech-to-text models using pseudo languages, improving performance across ASR, translation, and named entity recognition without relying on extensive labeled data.

Contribution

It is the first to pre-train both encoder and decoder parts of speech models using pseudo language representations, enhancing various speech tasks.

Findings

01

State-of-the-art results in spoken named entity recognition.

02

Consistent improvements in speech-to-text translation across 20 language pairs.

03

Comparable performance to recent optimized methods in ASR.

Abstract

We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data. We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task -- transcribing audio inputs into pseudo subword sequences. This process stands on its own, or can be applied as low-cost second-stage pre-training. We experiment with automatic speech recognition (ASR), spoken named entity recognition, and speech-to-text translation. We set new state-of-the-art results for end-to-end spoken named entity recognition, and show consistent improvements on 20 language pairs for speech-to-text translation, even when competing methods use additional text data for training. Finally, on ASR, our approach enables encoder-decoder methods to benefit from pre-training for all parts of the network, and shows…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

asappresearch/wav2seq
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing