X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
Yan Zeng, Xinsong Zhang, Hang Li, Jiawei Wang, Jipeng Zhang,, Wangchunshu Zhou

TL;DR
X$^2$-VLM is a versatile, unified pre-trained model that learns multi-grained vision-language alignments for both images and videos, achieving state-of-the-art results across multiple tasks with high transferability.
Contribution
The paper introduces X$^2$-VLM, a modular, all-in-one pre-training framework that unifies image-text and video-text learning with multi-grained alignments and localization.
Findings
X$^2$-VLM outperforms existing models on image-text and video-text tasks.
The modular design enables high transferability across languages and domains.
Replacing the text encoder with XLM-R enhances multilingual performance.
Abstract
Vision language pre-training aims to learn alignments between vision and language from a large amount of data. Most existing methods only learn image-text alignments. Some others utilize pre-trained object detectors to leverage vision language alignments at the object level. In this paper, we propose to learn multi-grained vision language alignments by a unified pre-training framework that learns multi-grained aligning and multi-grained localization simultaneously. Based on it, we present X-VLM, an all-in-one model with a flexible modular architecture, in which we further unify image-text pre-training and video-text pre-training in one model. X-VLM is able to learn unlimited visual concepts associated with diverse text descriptions. Experiment results show that X-VLM performs the best on base and large scale for both image-text and video-text tasks, making a good trade-off…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques
MethodsBalanced Selection · XLM-R
