Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning
Zhicheng Huang, Zhaoyang Zeng, Yupan Huang, Bei Liu, Dongmei Fu,, Jianlong Fu

TL;DR
This paper introduces SOHO, an end-to-end vision-language pre-training model that processes entire images without bounding boxes, enabling faster inference and improved understanding of visual semantics through a visual dictionary and masked visual modeling.
Contribution
SOHO is the first end-to-end model for vision-language pre-training that learns from whole images without bounding box annotations, improving speed and semantic comprehension.
Findings
Achieves 2.0% higher R@1 on MSCOCO text retrieval
Improves accuracy by 1.5% on NLVR$^2$
Increases SNLI-VE test accuracy by 6.7%
Abstract
We study joint learning of Convolutional Neural Network (CNN) and Transformer for vision-language pre-training (VLPT) which aims to learn cross-modal alignments from millions of image-text pairs. State-of-the-art approaches extract salient image regions and align regions with words step-by-step. As region-based visual features usually represent parts of an image, it is challenging for existing vision-language models to fully understand the semantics from paired natural languages. In this paper, we propose SOHO to "See Out of tHe bOx" that takes a whole image as input, and learns vision-language representation in an end-to-end manner. SOHO does not require bounding box annotations which enables inference 10 times faster than region-based approaches. In particular, SOHO learns to extract comprehensive yet compact image features through a visual dictionary (VD) that facilitates cross-modal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsLinear Layer · SOHO · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Softmax · Byte Pair Encoding · Attention Is All You Need · Residual Connection · Layer Normalization · Adam
