Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language   Pretraining?

Fei Wang; Liang Ding; Jun Rao; Ye Liu; Li Shen; Changxing Ding

arXiv:2308.12898·cs.MM·August 28, 2023·1 cites

Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language Pretraining?

Fei Wang, Liang Ding, Jun Rao, Ye Liu, Li Shen, Changxing Ding

PDF

Open Access 1 Repo

TL;DR

This paper investigates how linguistic knowledge like semantics and syntax can enhance multimodal alignment in vision-language pretraining, introducing a new benchmark to evaluate such linguistic understanding in VLP models.

Contribution

The paper presents SNARE, the first large-scale benchmark for probing linguistic knowledge in VLP models, and provides analysis on how these models understand complex linguistic structures.

Findings

01

VLP models rely mainly on content words, showing insensitivity to complex syntax.

02

Models have limited understanding of negation and sentence-combination.

03

Challenges remain in recognizing actions, spatial relations, and verifying triples.

Abstract

The multimedia community has shown a significant interest in perceiving and representing the physical world with multimodal pretrained neural network models, and among them, the visual-language pertaining (VLP) is, currently, the most captivating topic. However, there have been few endeavors dedicated to the exploration of 1) whether essential linguistic knowledge (e.g., semantics and syntax) can be extracted during VLP, and 2) how such linguistic knowledge impact or enhance the multimodal alignment. In response, here we aim to elucidate the impact of comprehensive linguistic knowledge, including semantic expression and syntactic structure, on multimodal alignment. Specifically, we design and release the SNARE, the first large-scale multimodal alignment probing benchmark, to detect the vital linguistic components, e.g., lexical, semantic, and syntax knowledge, containing four tasks:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wangfei-2019/snare
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition · Subtitles and Audiovisual Media