Single-Stream Multi-Level Alignment for Vision-Language Pretraining
Zaid Khan, Vijay Kumar BG, Xiang Yu, Samuel Schulter, Manmohan, Chandraker, Yun Fu

TL;DR
This paper introduces a novel single-stream vision-language pretraining method that aligns images and text at multiple levels, improving fine-grained understanding and semantic grounding without dense annotations.
Contribution
The proposed approach employs two new tasks, symmetric cross-modality reconstruction and pseudo-labeled keyword prediction, enabling multi-level alignment in a single-stream architecture.
Findings
Achieves competitive performance on image-text retrieval and grounding
Demonstrates improved data efficiency over existing models
Enhances semantic grounding and fine-grained alignment
Abstract
Self-supervised vision-language pretraining from pure images and text with a contrastive loss is effective, but ignores fine-grained alignment due to a dual-stream architecture that aligns image and text representations only on a global level. Earlier, supervised, non-contrastive methods were capable of finer-grained alignment, but required dense annotations that were not scalable. We propose a single stream architecture that aligns images and language at multiple levels: global, fine-grained patch-token, and conceptual/semantic, using two novel tasks: symmetric cross-modality reconstruction (XMM) and a pseudo-labeled key word prediction (PSL). In XMM, we mask input tokens from one modality and use cross-modal information to reconstruct the masked token, thus improving fine-grained alignment between the two modalities. In PSL, we use attention to select keywords in a caption, use a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsALIGN
