Probing Cross-modal Semantics Alignment Capability from the Textual   Perspective

Zheng Ma; Shi Zong; Mianzhi Pan; Jianbing Zhang; Shujian Huang; Xinyu; Dai; Jiajun Chen

arXiv:2210.09550·cs.CL·October 19, 2022

Probing Cross-modal Semantics Alignment Capability from the Textual Perspective

Zheng Ma, Shi Zong, Mianzhi Pan, Jianbing Zhang, Shujian Huang, Xinyu, Dai, Jiajun Chen

PDF

Open Access

TL;DR

This paper introduces a novel image captioning-based probing method to analyze how vision-language pre-training models align cross-modal semantics, revealing their focus on object-word alignment over global semantics and their preference for fixed sentence patterns.

Contribution

The paper presents a new probing approach using image captioning to empirically study cross-modal semantics alignment in VLP models, providing detailed insights into their inner workings.

Findings

01

VLP models mainly align objects with visual words

02

Models prefer fixed sentence patterns over diverse expressions

03

Captions with more visual words are perceived as better aligned

Abstract

In recent years, vision and language pre-training (VLP) models have advanced the state-of-the-art results in a variety of cross-modal downstream tasks. Aligning cross-modal semantics is claimed to be one of the essential capabilities of VLP models. However, it still remains unclear about the inner working mechanism of alignment in VLP models. In this paper, we propose a new probing method that is based on image captioning to first empirically study the cross-modal semantics alignment of VLP models. Our probing method is built upon the fact that given an image-caption pair, the VLP models will give a score, indicating how well two modalities are aligned; maximizing such scores will generate sentences that VLP models believe are of good alignment. Analyzing these sentences thus will reveal in what way different modalities are aligned and how well these alignments are in VLP models. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Subtitles and Audiovisual Media

MethodsUNiversal Image-TExt Representation Learning · Vision-and-Language BERT · Contrastive Language-Image Pre-training · Learning Cross-Modality Encoder Representations from Transformers