VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning
Qiushi Zhu, Long Zhou, Ziqiang Zhang, Shujie Liu, Binxing Jiao, Jie, Zhang, Lirong Dai, Daxin Jiang, Jinyu Li, Furu Wei

TL;DR
VATLM is a unified multimodal pre-training framework that integrates visual, audio, and text data to improve speech-related tasks by aligning different modalities into a shared semantic space.
Contribution
This paper introduces VATLM, a novel unified framework for cross-modal speech representation learning that effectively combines visual, audio, and text modalities using a shared backbone and masked prediction.
Findings
VATLM outperforms previous state-of-the-art models on AVSR and VSR tasks.
VATLM effectively aligns different modalities into a shared semantic space.
The unified tokenizer enables seamless integration of multiple modalities.
Abstract
Although speech is a simple and effective way for humans to communicate with the outside world, a more realistic speech interaction contains multimodal information, e.g., vision, text. How to design a unified framework to integrate different modal information and leverage different resources (e.g., visual-audio pairs, audio-text pairs, unlabeled speech, and unlabeled text) to facilitate speech representation learning was not well explored. In this paper, we propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model). The proposed VATLM employs a unified backbone network to model the modality-independent information and utilizes three simple modality-dependent modules to preprocess visual, speech, and text inputs. In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Subtitles and Audiovisual Media · Hearing Loss and Rehabilitation
