Grid-VLP: Revisiting Grid Features for Vision-Language Pre-training

Ming Yan; Haiyang Xu; Chenliang Li; Bin Bi; Junfeng Tian; Min Gui and; Wei Wang

arXiv:2108.09479·cs.MM·August 24, 2021·6 cites

Grid-VLP: Revisiting Grid Features for Vision-Language Pre-training

Ming Yan, Haiyang Xu, Chenliang Li, Bin Bi, Junfeng Tian, Min Gui and, Wei Wang

PDF

Open Access

TL;DR

Grid-VLP introduces a grid-based approach to vision-language pre-training that bypasses object detectors, achieving competitive performance with improved efficiency and end-to-end training capability.

Contribution

The paper presents a novel grid-based VLP method that eliminates the need for object detectors, simplifying the process and enhancing training efficiency.

Findings

01

Outperforms many region-based VLP methods on key tasks

02

Effective with only in-domain dataset pre-training

03

Supports end-to-end training without object detection constraints

Abstract

Existing approaches to vision-language pre-training (VLP) heavily rely on an object detector based on bounding boxes (regions), where salient objects are first detected from images and then a Transformer-based model is used for cross-modal fusion. Despite their superior performance, these approaches are bounded by the capability of the object detector in terms of both effectiveness and efficiency. Besides, the presence of object detection imposes unnecessary constraints on model designs and makes it difficult to support end-to-end training. In this paper, we revisit grid-based convolutional features for vision-language pre-training, skipping the expensive region-related steps. We propose a simple yet effective grid-based VLP method that works surprisingly well with the grid features. By pre-training only with in-domain datasets, the proposed Grid-VLP method can outperform most…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization