Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language   Representation Learning

Zhicheng Huang; Zhaoyang Zeng; Yupan Huang; Bei Liu; Dongmei Fu,; Jianlong Fu

arXiv:2104.03135·cs.CV·April 9, 2021·24 cites

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Zhicheng Huang, Zhaoyang Zeng, Yupan Huang, Bei Liu, Dongmei Fu,, Jianlong Fu

PDF

Open Access 3 Repos

TL;DR

This paper introduces SOHO, an end-to-end vision-language pre-training model that processes entire images without bounding boxes, enabling faster inference and improved understanding of visual semantics through a visual dictionary and masked visual modeling.

Contribution

SOHO is the first end-to-end model for vision-language pre-training that learns from whole images without bounding box annotations, improving speed and semantic comprehension.

Findings

01

Achieves 2.0% higher R@1 on MSCOCO text retrieval

02

Improves accuracy by 1.5% on NLVR$^2$

03

Increases SNLI-VE test accuracy by 6.7%

Abstract

We study joint learning of Convolutional Neural Network (CNN) and Transformer for vision-language pre-training (VLPT) which aims to learn cross-modal alignments from millions of image-text pairs. State-of-the-art approaches extract salient image regions and align regions with words step-by-step. As region-based visual features usually represent parts of an image, it is challenging for existing vision-language models to fully understand the semantics from paired natural languages. In this paper, we propose SOHO to "See Out of tHe bOx" that takes a whole image as input, and learns vision-language representation in an end-to-end manner. SOHO does not require bounding box annotations which enables inference 10 times faster than region-based approaches. In particular, SOHO learns to extract comprehensive yet compact image features through a visual dictionary (VD) that facilitates cross-modal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsLinear Layer · SOHO · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Softmax · Byte Pair Encoding · Attention Is All You Need · Residual Connection · Layer Normalization · Adam