Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource   Parallel Data

Yu Kang; Tianqiao Liu; Hang Li; Yang Hao; Wenbiao Ding

arXiv:2204.04645·cs.SD·April 12, 2022

Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource Parallel Data

Yu Kang, Tianqiao Liu, Hang Li, Yang Hao, Wenbiao Ding

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a novel self-supervised pre-training framework for audio-and-text models that effectively utilizes extremely low-resource parallel data combined with abundant unimodal data, achieving competitive downstream task performance.

Contribution

It proposes a new pre-training approach with intra-modal and cross-modal denoising auto-encoding and an iterative denoising process, enabling effective multimodal learning with minimal parallel data.

Findings

01

Achieves comparable performance to fully parallel data pre-training on multiple tasks.

02

Demonstrates the effectiveness of low-resource pre-training methods.

03

Shows potential for multilingual and low-resource language applications.

Abstract

Multimodal pre-training for audio-and-text has recently been proved to be effective and has significantly improved the performance of many downstream speech understanding tasks. However, these state-of-the-art pre-training audio-text models work well only when provided with large amount of parallel audio-and-text data, which brings challenges on many languages that are rich in unimodal corpora but scarce of parallel cross-modal corpus. In this paper, we investigate whether it is possible to pre-train an audio-text multimodal model with extremely low-resource parallel data and extra non-parallel unimodal data. Our pre-training framework consists of the following components: (1) Intra-modal Denoising Auto-Encoding (IDAE), which is able to reconstruct input text (audio) representations from a noisy version of itself. (2) Cross-modal Denoising Auto-Encoding (CDAE), which is pre-trained to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

karlyukang/low-resource-multimodal-pre-training
pytorchOfficial

Videos

Self-Supervised Audio-and-Text Pre-Training with Extremely Low-Resource Parallel Data· underline

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Dense Connections · Layer Normalization · Absolute Position Encodings · Softmax · Residual Connection