VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for   Speech Representation Learning

Qiushi Zhu; Long Zhou; Ziqiang Zhang; Shujie Liu; Binxing Jiao; Jie; Zhang; Lirong Dai; Daxin Jiang; Jinyu Li; Furu Wei

arXiv:2211.11275·eess.AS·May 22, 2023·1 cites

VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning

Qiushi Zhu, Long Zhou, Ziqiang Zhang, Shujie Liu, Binxing Jiao, Jie, Zhang, Lirong Dai, Daxin Jiang, Jinyu Li, Furu Wei

PDF

Open Access

TL;DR

VATLM is a unified multimodal pre-training framework that integrates visual, audio, and text data to improve speech-related tasks by aligning different modalities into a shared semantic space.

Contribution

This paper introduces VATLM, a novel unified framework for cross-modal speech representation learning that effectively combines visual, audio, and text modalities using a shared backbone and masked prediction.

Findings

01

VATLM outperforms previous state-of-the-art models on AVSR and VSR tasks.

02

VATLM effectively aligns different modalities into a shared semantic space.

03

The unified tokenizer enables seamless integration of multiple modalities.

Abstract

Although speech is a simple and effective way for humans to communicate with the outside world, a more realistic speech interaction contains multimodal information, e.g., vision, text. How to design a unified framework to integrate different modal information and leverage different resources (e.g., visual-audio pairs, audio-text pairs, unlabeled speech, and unlabeled text) to facilitate speech representation learning was not well explored. In this paper, we propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model). The proposed VATLM employs a unified backbone network to model the modality-independent information and utilizes three simple modality-dependent modules to preprocess visual, speech, and text inputs. In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Subtitles and Audiovisual Media · Hearing Loss and Rehabilitation