Decoupled Box Proposal and Featurization with Ultrafine-Grained Semantic   Labels Improve Image Captioning and Visual Question Answering

Soravit Changpinyo; Bo Pang; Piyush Sharma; Radu Soricut

arXiv:1909.02097·cs.CL·September 6, 2019

Decoupled Box Proposal and Featurization with Ultrafine-Grained Semantic Labels Improve Image Captioning and Visual Question Answering

Soravit Changpinyo, Bo Pang, Piyush Sharma, Radu Soricut

PDF

Open Access

TL;DR

Decoupling box proposal and featurization with fine-grained semantic labels enhances transfer learning, leading to better performance in image captioning and visual question answering tasks.

Contribution

We introduce a decoupled approach to object detection that leverages detailed semantic labels, improving transfer learning for vision-language tasks.

Findings

01

Improved performance on image captioning benchmarks.

02

Enhanced results in visual question answering.

03

Effective transfer learning from decoupled detection models.

Abstract

Object detection plays an important role in current solutions to vision and language tasks like image captioning and visual question answering. However, popular models like Faster R-CNN rely on a costly process of annotating ground-truths for both the bounding boxes and their corresponding semantic labels, making it less amenable as a primitive task for transfer learning. In this paper, we examine the effect of decoupling box proposal and featurization for down-stream tasks. The key insight is that this allows us to leverage a large amount of labeled annotations that were previously unavailable for standard object detection benchmarks. Empirically, we demonstrate that this leads to effective transfer learning and improved image captioning and visual question answering models, as measured on publicly available benchmarks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsRegion Proposal Network · Softmax · Convolution · RoIPool · Faster R-CNN