Fine-Grained Semantically Aligned Vision-Language Pre-Training

Juncheng Li; Xin He; Longhui Wei; Long Qian; Linchao Zhu; Lingxi Xie,; Yueting Zhuang; Qi Tian; Siliang Tang

arXiv:2208.02515·cs.CV·September 20, 2022·29 cites

Fine-Grained Semantically Aligned Vision-Language Pre-Training

Juncheng Li, Xin He, Longhui Wei, Long Qian, Linchao Zhu, Lingxi Xie,, Yueting Zhuang, Qi Tian, Siliang Tang

PDF

Open Access 1 Repo 1 Video

TL;DR

LOUPE introduces a novel game-theoretic framework for fine-grained semantic alignment in vision-language pre-training, significantly improving task performance without requiring object-level annotations.

Contribution

It proposes a new game-theoretic approach with an uncertainty-aware neural Shapley interaction module for fine-grained alignment in vision-language models.

Findings

01

Achieves state-of-the-art results on multiple vision-language tasks.

02

Performs competitively on object detection and visual grounding without object-level annotations.

03

Introduces a scalable method for learning fine-grained semantics from large-scale raw data.

Abstract

Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks. Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts, or advanced cross-modal attention upon image and text features. However, they fail to explicitly learn the fine-grained semantic alignment between visual regions and textual phrases, as only global image-text alignment information is available. In this paper, we introduce LOUPE, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions. To efficiently compute the game-theoretic interactions, we further propose an uncertainty-aware neural Shapley interaction learning module. Experiments show that LOUPE achieves state-of-the-art performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yyjmjc/loupe
noneOfficial

Videos

Fine-Grained Semantically Aligned Vision-Language Pre-Training· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques