Fine-Grained Semantically Aligned Vision-Language Pre-Training
Juncheng Li, Xin He, Longhui Wei, Long Qian, Linchao Zhu, Lingxi Xie,, Yueting Zhuang, Qi Tian, Siliang Tang

TL;DR
LOUPE introduces a novel game-theoretic framework for fine-grained semantic alignment in vision-language pre-training, significantly improving task performance without requiring object-level annotations.
Contribution
It proposes a new game-theoretic approach with an uncertainty-aware neural Shapley interaction module for fine-grained alignment in vision-language models.
Findings
Achieves state-of-the-art results on multiple vision-language tasks.
Performs competitively on object detection and visual grounding without object-level annotations.
Introduces a scalable method for learning fine-grained semantics from large-scale raw data.
Abstract
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks. Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts, or advanced cross-modal attention upon image and text features. However, they fail to explicitly learn the fine-grained semantic alignment between visual regions and textual phrases, as only global image-text alignment information is available. In this paper, we introduce LOUPE, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions. To efficiently compute the game-theoretic interactions, we further propose an uncertainty-aware neural Shapley interaction learning module. Experiments show that LOUPE achieves state-of-the-art performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
