VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and   Dataset

Sihan Chen; Handong Li; Qunbo Wang; Zijia Zhao; Mingzhen Sun; Xinxin; Zhu; Jing Liu

arXiv:2305.18500·cs.CV·October 10, 2023·28 cites

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

Sihan Chen, Handong Li, Qunbo Wang, Zijia Zhao, Mingzhen Sun, Xinxin, Zhu, Jing Liu

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper introduces VAST, a large-scale omni-modality video dataset and foundation model that integrates vision, audio, subtitles, and text, enabling improved multi-modal understanding and tasks in videos.

Contribution

The paper presents VAST-27M, a novel large-scale omni-modality dataset, and a foundation model that jointly processes vision, audio, subtitles, and text for comprehensive video understanding.

Findings

01

VAST achieves 22 new state-of-the-art results on cross-modality benchmarks.

02

The dataset effectively supports training multi-modal video understanding models.

03

The VAST model demonstrates strong performance across retrieval, captioning, and QA tasks.

Abstract

Vision and text have been fully explored in contemporary video-text foundational models, while other modalities such as audio and subtitles in videos have not received sufficient attention. In this paper, we resort to establish connections between multi-modality video tracks, including Vision, Audio, and Subtitle, and Text by exploring an automatically generated large-scale omni-modality video caption dataset called VAST-27M. Specifically, we first collect 27 million open-domain video clips and separately train a vision and an audio captioner to generate vision and audio captions. Then, we employ an off-the-shelf Large Language Model (LLM) to integrate the generated captions, together with subtitles and instructional prompts into omni-modality captions. Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Handwritten Text Recognition Techniques