OmniVL:One Foundation Model for Image-Language and Video-Language Tasks

Junke Wang; Dongdong Chen; Zuxuan Wu; Chong Luo; Luowei; Zhou; Yucheng Zhao; Yujia Xie; Ce Liu; Yu-Gang Jiang; Lu Yuan

arXiv:2209.07526·cs.CV·October 21, 2022·68 cites

OmniVL:One Foundation Model for Image-Language and Video-Language Tasks

Junke Wang, Dongdong Chen, Zuxuan Wu, Chong Luo, Luowei, Zhou, Yucheng Zhao, Yujia Xie, Ce Liu, Yu-Gang Jiang, Lu Yuan

PDF

Open Access

TL;DR

OmniVL introduces a unified transformer-based foundation model capable of handling both image-language and video-language tasks through joint pretraining and a novel contrastive loss, achieving state-of-the-art results across diverse tasks.

Contribution

The paper proposes a universal architecture for image and video-language tasks, utilizing decoupled joint pretraining and a unified contrastive loss to enhance multi-modal understanding.

Findings

01

Supports a wide range of tasks without task-specific adaptors

02

Achieves state-of-the-art or competitive results on multiple benchmarks

03

Effectively leverages both supervised and noisily supervised data

Abstract

This paper presents OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture. It adopts a unified transformer-based visual encoder for both image and video inputs, and thus can perform joint image-language and video-language pretraining. We demonstrate, for the first time, such a paradigm benefits both image and video tasks, as opposed to the conventional one-directional transfer (e.g., use image-language to help video-language). To this end, we propose a decoupled joint pretraining of image-language and video-language to effectively decompose the vision-language modeling into spatial and temporal dimensions and obtain performance boost on both image and video tasks. Moreover, we introduce a novel unified vision-language contrastive (UniVLC) loss to leverage image-text, video-text, image-label (e.g., image classification),…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques