SemVLP: Vision-Language Pre-training by Aligning Semantics at Multiple Levels
Chenliang Li, Ming Yan, Haiyang Xu, Fuli Luo, Wei Wang, Bin Bi,, Songfang Huang

TL;DR
SemVLP introduces a novel vision-language pre-training approach that aligns both low-level and high-level semantics across modalities using a shared Transformer with flexible attention, improving understanding of cross-modal data.
Contribution
It proposes a joint alignment method for multiple semantic levels in vision-language pre-training, combining single-stream and two-stream strategies within a shared model.
Findings
Enhanced cross-modal representation alignment at multiple semantic levels.
Improved performance on four vision-language understanding tasks.
Effective integration of fine-grained and high-level semantic alignment.
Abstract
Vision-language pre-training (VLP) on large-scale image-text pairs has recently witnessed rapid progress for learning cross-modal representations. Existing pre-training methods either directly concatenate image representation and text representation at a feature level as input to a single-stream Transformer, or use a two-stream cross-modal Transformer to align the image-text representation at a high-level semantic space. In real-world image-text data, we observe that it is easy for some of the image-text pairs to align simple semantics on both modalities, while others may be related after higher-level abstraction. Therefore, in this paper, we propose a new pre-training method SemVLP, which jointly aligns both the low-level and high-level semantics between image and text representations. The model is pre-trained iteratively with two prevalent fashions: single-stream pre-training to align…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Layer Normalization · Adam · Byte Pair Encoding · Attention Is All You Need · Label Smoothing · Dropout · Residual Connection
