Probing Cross-modal Semantics Alignment Capability from the Textual Perspective
Zheng Ma, Shi Zong, Mianzhi Pan, Jianbing Zhang, Shujian Huang, Xinyu, Dai, Jiajun Chen

TL;DR
This paper introduces a novel image captioning-based probing method to analyze how vision-language pre-training models align cross-modal semantics, revealing their focus on object-word alignment over global semantics and their preference for fixed sentence patterns.
Contribution
The paper presents a new probing approach using image captioning to empirically study cross-modal semantics alignment in VLP models, providing detailed insights into their inner workings.
Findings
VLP models mainly align objects with visual words
Models prefer fixed sentence patterns over diverse expressions
Captions with more visual words are perceived as better aligned
Abstract
In recent years, vision and language pre-training (VLP) models have advanced the state-of-the-art results in a variety of cross-modal downstream tasks. Aligning cross-modal semantics is claimed to be one of the essential capabilities of VLP models. However, it still remains unclear about the inner working mechanism of alignment in VLP models. In this paper, we propose a new probing method that is based on image captioning to first empirically study the cross-modal semantics alignment of VLP models. Our probing method is built upon the fact that given an image-caption pair, the VLP models will give a score, indicating how well two modalities are aligned; maximizing such scores will generate sentences that VLP models believe are of good alignment. Analyzing these sentences thus will reveal in what way different modalities are aligned and how well these alignments are in VLP models. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Subtitles and Audiovisual Media
MethodsUNiversal Image-TExt Representation Learning · Vision-and-Language BERT · Contrastive Language-Image Pre-training · Learning Cross-Modality Encoder Representations from Transformers
