Single-Stream Multi-Level Alignment for Vision-Language Pretraining

Zaid Khan; Vijay Kumar BG; Xiang Yu; Samuel Schulter; Manmohan; Chandraker; Yun Fu

arXiv:2203.14395·cs.CV·July 29, 2022

Single-Stream Multi-Level Alignment for Vision-Language Pretraining

Zaid Khan, Vijay Kumar BG, Xiang Yu, Samuel Schulter, Manmohan, Chandraker, Yun Fu

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel single-stream vision-language pretraining method that aligns images and text at multiple levels, improving fine-grained understanding and semantic grounding without dense annotations.

Contribution

The proposed approach employs two new tasks, symmetric cross-modality reconstruction and pseudo-labeled keyword prediction, enabling multi-level alignment in a single-stream architecture.

Findings

01

Achieves competitive performance on image-text retrieval and grounding

02

Demonstrates improved data efficiency over existing models

03

Enhances semantic grounding and fine-grained alignment

Abstract

Self-supervised vision-language pretraining from pure images and text with a contrastive loss is effective, but ignores fine-grained alignment due to a dual-stream architecture that aligns image and text representations only on a global level. Earlier, supervised, non-contrastive methods were capable of finer-grained alignment, but required dense annotations that were not scalable. We propose a single stream architecture that aligns images and language at multiple levels: global, fine-grained patch-token, and conceptual/semantic, using two novel tasks: symmetric cross-modality reconstruction (XMM) and a pseudo-labeled key word prediction (PSL). In XMM, we mask input tokens from one modality and use cross-modal information to reconstruct the masked token, thus improving fine-grained alignment between the two modalities. In PSL, we use attention to select keywords in a caption, use a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

codezakh/SIMLA
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsALIGN