token2vec: A Joint Self-Supervised Pre-training Framework Using Unpaired Speech and Text
Xianghu Yue, Junyi Ao, Xiaoxue Gao, Haizhou Li

TL;DR
Token2vec introduces a joint self-supervised pre-training framework for unpaired speech and text, leveraging discrete speech tokens and phoneme sequences to improve speech recognition and transferability.
Contribution
It proposes a novel method to perform speech-text joint pre-training on unpaired data using discrete speech tokens and a modality-agnostic Transformer.
Findings
Up to 17.7% relative WER reduction over speech-only baselines.
Effective transfer to spoken intent classification.
Demonstrates the feasibility of joint pre-training on unpaired speech and text.
Abstract
Self-supervised pre-training has been successful in both text and speech processing. Speech and text offer different but complementary information. The question is whether we are able to perform a speech-text joint pre-training on unpaired speech and text. In this paper, we take the idea of self-supervised pre-training one step further and propose token2vec, a novel joint pre-training framework for unpaired speech and text based on discrete representations of speech. Firstly, due to the distinct characteristics between speech and text modalities, where speech is continuous while text is discrete, we first discretize speech into a sequence of discrete speech tokens to solve the modality mismatch problem. Secondly, to solve the length mismatch problem, where the speech sequence is usually much longer than text sequence, we convert the words of text into phoneme sequences and randomly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Adam · Position-Wise Feed-Forward Layer · Dense Connections · Label Smoothing · Absolute Position Encodings · Layer Normalization
