X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks

Yan Zeng; Xinsong Zhang; Hang Li; Jiawei Wang; Jipeng Zhang,; Wangchunshu Zhou

arXiv:2211.12402·cs.CV·August 1, 2023·5 cites

X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks

Yan Zeng, Xinsong Zhang, Hang Li, Jiawei Wang, Jipeng Zhang,, Wangchunshu Zhou

PDF

Open Access 2 Repos

TL;DR

X$^2$-VLM is a versatile, unified pre-trained model that learns multi-grained vision-language alignments for both images and videos, achieving state-of-the-art results across multiple tasks with high transferability.

Contribution

The paper introduces X$^2$-VLM, a modular, all-in-one pre-training framework that unifies image-text and video-text learning with multi-grained alignments and localization.

Findings

01

X$^2$-VLM outperforms existing models on image-text and video-text tasks.

02

The modular design enables high transferability across languages and domains.

03

Replacing the text encoder with XLM-R enhances multilingual performance.

Abstract

Vision language pre-training aims to learn alignments between vision and language from a large amount of data. Most existing methods only learn image-text alignments. Some others utilize pre-trained object detectors to leverage vision language alignments at the object level. In this paper, we propose to learn multi-grained vision language alignments by a unified pre-training framework that learns multi-grained aligning and multi-grained localization simultaneously. Based on it, we present X $^{2}$ -VLM, an all-in-one model with a flexible modular architecture, in which we further unify image-text pre-training and video-text pre-training in one model. X $^{2}$ -VLM is able to learn unlimited visual concepts associated with diverse text descriptions. Experiment results show that X $^{2}$ -VLM performs the best on base and large scale for both image-text and video-text tasks, making a good trade-off…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques

MethodsBalanced Selection · XLM-R