Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource Parallel Data
Yu Kang, Tianqiao Liu, Hang Li, Yang Hao, Wenbiao Ding

TL;DR
This paper introduces a novel self-supervised pre-training framework for audio-and-text models that effectively utilizes extremely low-resource parallel data combined with abundant unimodal data, achieving competitive downstream task performance.
Contribution
It proposes a new pre-training approach with intra-modal and cross-modal denoising auto-encoding and an iterative denoising process, enabling effective multimodal learning with minimal parallel data.
Findings
Achieves comparable performance to fully parallel data pre-training on multiple tasks.
Demonstrates the effectiveness of low-resource pre-training methods.
Shows potential for multilingual and low-resource language applications.
Abstract
Multimodal pre-training for audio-and-text has recently been proved to be effective and has significantly improved the performance of many downstream speech understanding tasks. However, these state-of-the-art pre-training audio-text models work well only when provided with large amount of parallel audio-and-text data, which brings challenges on many languages that are rich in unimodal corpora but scarce of parallel cross-modal corpus. In this paper, we investigate whether it is possible to pre-train an audio-text multimodal model with extremely low-resource parallel data and extra non-parallel unimodal data. Our pre-training framework consists of the following components: (1) Intra-modal Denoising Auto-Encoding (IDAE), which is able to reconstruct input text (audio) representations from a noisy version of itself. (2) Cross-modal Denoising Auto-Encoding (CDAE), which is pre-trained to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Dense Connections · Layer Normalization · Absolute Position Encodings · Softmax · Residual Connection
