token2vec: A Joint Self-Supervised Pre-training Framework Using Unpaired   Speech and Text

Xianghu Yue; Junyi Ao; Xiaoxue Gao; Haizhou Li

arXiv:2210.16755·cs.CL·November 1, 2022

token2vec: A Joint Self-Supervised Pre-training Framework Using Unpaired Speech and Text

Xianghu Yue, Junyi Ao, Xiaoxue Gao, Haizhou Li

PDF

Open Access

TL;DR

Token2vec introduces a joint self-supervised pre-training framework for unpaired speech and text, leveraging discrete speech tokens and phoneme sequences to improve speech recognition and transferability.

Contribution

It proposes a novel method to perform speech-text joint pre-training on unpaired data using discrete speech tokens and a modality-agnostic Transformer.

Findings

01

Up to 17.7% relative WER reduction over speech-only baselines.

02

Effective transfer to spoken intent classification.

03

Demonstrates the feasibility of joint pre-training on unpaired speech and text.

Abstract

Self-supervised pre-training has been successful in both text and speech processing. Speech and text offer different but complementary information. The question is whether we are able to perform a speech-text joint pre-training on unpaired speech and text. In this paper, we take the idea of self-supervised pre-training one step further and propose token2vec, a novel joint pre-training framework for unpaired speech and text based on discrete representations of speech. Firstly, due to the distinct characteristics between speech and text modalities, where speech is continuous while text is discrete, we first discretize speech into a sequence of discrete speech tokens to solve the modality mismatch problem. Secondly, to solve the length mismatch problem, where the speech sequence is usually much longer than text sequence, we convert the words of text into phoneme sequences and randomly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Adam · Position-Wise Feed-Forward Layer · Dense Connections · Label Smoothing · Absolute Position Encodings · Layer Normalization