Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual   Concepts

Yan Zeng; Xinsong Zhang; Hang Li

arXiv:2111.08276·cs.CL·June 2, 2022·96 cites

Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts

Yan Zeng, Xinsong Zhang, Hang Li

PDF

Open Access 1 Repo 2 Models

TL;DR

This paper introduces X-VLM, a multi-grained vision language pre-training method that aligns texts with visual concepts at multiple levels, improving performance on various downstream tasks.

Contribution

The paper presents a novel multi-grained alignment approach for vision language pre-training, addressing limitations of object-centric methods in capturing relations among multiple objects.

Findings

01

X-VLM outperforms state-of-the-art methods on multiple vision language tasks.

02

Effective learning of multi-grained alignments enhances downstream task performance.

03

The method successfully locates visual concepts and aligns them with texts at multiple granularities.

Abstract

Most existing methods in vision language pre-training rely on object-centric features extracted through object detection and make fine-grained alignments between the extracted features and texts. It is challenging for these methods to learn relations among multiple objects. To this end, we propose a new method called X-VLM to perform `multi-grained vision language pre-training.' The key to learning multi-grained alignments is to locate visual concepts in the image given the associated texts, and in the meantime align the texts with the visual concepts, where the alignments are in multi-granularity. Experimental results show that X-VLM effectively leverages the learned multi-grained alignments to many downstream vision language tasks and consistently outperforms state-of-the-art methods.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zengyan-97/x-vlm
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques