VinVL: Revisiting Visual Representations in Vision-Language Models
Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang,, Lijuan Wang, Yejin Choi, Jianfeng Gao

TL;DR
This paper enhances vision-language models by developing a larger, better-designed object detection model that provides richer visual features, significantly improving performance across multiple VL benchmarks.
Contribution
The paper introduces an improved object detection model pre-trained on larger datasets, which enhances visual feature quality for VL tasks, leading to state-of-the-art results.
Findings
Significant performance improvements on seven VL benchmarks.
Visual features from the new detector outperform previous features.
Enhanced object-centric representations benefit VL model accuracy.
Abstract
This paper presents a detailed study of improving visual representations for vision language (VL) tasks and develops an improved object detection model to provide object-centric representations of images. Compared to the most widely used \emph{bottom-up and top-down} model \cite{anderson2018bottom}, the new model is bigger, better-designed for VL tasks, and pre-trained on much larger training corpora that combine multiple public annotated object detection datasets. Therefore, it can generate representations of a richer collection of visual objects and concepts. While previous VL research focuses mainly on improving the vision-language fusion model and leaves the object detection model improvement untouched, we show that visual features matter significantly in VL models. In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
