Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts
Yan Zeng, Xinsong Zhang, Hang Li

TL;DR
This paper introduces X-VLM, a multi-grained vision language pre-training method that aligns texts with visual concepts at multiple levels, improving performance on various downstream tasks.
Contribution
The paper presents a novel multi-grained alignment approach for vision language pre-training, addressing limitations of object-centric methods in capturing relations among multiple objects.
Findings
X-VLM outperforms state-of-the-art methods on multiple vision language tasks.
Effective learning of multi-grained alignments enhances downstream task performance.
The method successfully locates visual concepts and aligns them with texts at multiple granularities.
Abstract
Most existing methods in vision language pre-training rely on object-centric features extracted through object detection and make fine-grained alignments between the extracted features and texts. It is challenging for these methods to learn relations among multiple objects. To this end, we propose a new method called X-VLM to perform `multi-grained vision language pre-training.' The key to learning multi-grained alignments is to locate visual concepts in the image given the associated texts, and in the meantime align the texts with the visual concepts, where the alignments are in multi-granularity. Experimental results show that X-VLM effectively leverages the learned multi-grained alignments to many downstream vision language tasks and consistently outperforms state-of-the-art methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
