SemVLP: Vision-Language Pre-training by Aligning Semantics at Multiple   Levels

Chenliang Li; Ming Yan; Haiyang Xu; Fuli Luo; Wei Wang; Bin Bi,; Songfang Huang

arXiv:2103.07829·cs.CL·March 16, 2021·20 cites

SemVLP: Vision-Language Pre-training by Aligning Semantics at Multiple Levels

Chenliang Li, Ming Yan, Haiyang Xu, Fuli Luo, Wei Wang, Bin Bi,, Songfang Huang

PDF

Open Access

TL;DR

SemVLP introduces a novel vision-language pre-training approach that aligns both low-level and high-level semantics across modalities using a shared Transformer with flexible attention, improving understanding of cross-modal data.

Contribution

It proposes a joint alignment method for multiple semantic levels in vision-language pre-training, combining single-stream and two-stream strategies within a shared model.

Findings

01

Enhanced cross-modal representation alignment at multiple semantic levels.

02

Improved performance on four vision-language understanding tasks.

03

Effective integration of fine-grained and high-level semantic alignment.

Abstract

Vision-language pre-training (VLP) on large-scale image-text pairs has recently witnessed rapid progress for learning cross-modal representations. Existing pre-training methods either directly concatenate image representation and text representation at a feature level as input to a single-stream Transformer, or use a two-stream cross-modal Transformer to align the image-text representation at a high-level semantic space. In real-world image-text data, we observe that it is easy for some of the image-text pairs to align simple semantics on both modalities, while others may be related after higher-level abstraction. Therefore, in this paper, we propose a new pre-training method SemVLP, which jointly aligns both the low-level and high-level semantics between image and text representations. The model is pre-trained iteratively with two prevalent fashions: single-stream pre-training to align…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Layer Normalization · Adam · Byte Pair Encoding · Attention Is All You Need · Label Smoothing · Dropout · Residual Connection