VinVL: Revisiting Visual Representations in Vision-Language Models

Pengchuan Zhang; Xiujun Li; Xiaowei Hu; Jianwei Yang; Lei Zhang,; Lijuan Wang; Yejin Choi; Jianfeng Gao

arXiv:2101.00529·cs.CV·March 11, 2021·60 cites

VinVL: Revisiting Visual Representations in Vision-Language Models

Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang,, Lijuan Wang, Yejin Choi, Jianfeng Gao

PDF

Open Access 5 Repos

TL;DR

This paper enhances vision-language models by developing a larger, better-designed object detection model that provides richer visual features, significantly improving performance across multiple VL benchmarks.

Contribution

The paper introduces an improved object detection model pre-trained on larger datasets, which enhances visual feature quality for VL tasks, leading to state-of-the-art results.

Findings

01

Significant performance improvements on seven VL benchmarks.

02

Visual features from the new detector outperform previous features.

03

Enhanced object-centric representations benefit VL model accuracy.

Abstract

This paper presents a detailed study of improving visual representations for vision language (VL) tasks and develops an improved object detection model to provide object-centric representations of images. Compared to the most widely used \emph{bottom-up and top-down} model \cite{anderson2018bottom}, the new model is bigger, better-designed for VL tasks, and pre-trained on much larger training corpora that combine multiple public annotated object detection datasets. Therefore, it can generate representations of a richer collection of visual objects and concepts. While previous VL research focuses mainly on improving the vision-language fusion model and leaves the object detection model improvement untouched, we show that visual features matter significantly in VL models. In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques