DenseCap: Fully Convolutional Localization Networks for Dense Captioning

Justin Johnson; Andrej Karpathy; Li Fei-Fei

arXiv:1511.07571·cs.CV·November 25, 2015·83 cites

DenseCap: Fully Convolutional Localization Networks for Dense Captioning

Justin Johnson, Andrej Karpathy, Li Fei-Fei

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces dense captioning, a task that localizes and describes multiple regions in images, and proposes an end-to-end fully convolutional network that improves speed and accuracy on the Visual Genome dataset.

Contribution

The paper presents a novel Fully Convolutional Localization Network (FCLN) for dense captioning, enabling joint localization and description in a single, efficient model trained end-to-end.

Findings

01

Achieved faster inference and higher accuracy than previous methods.

02

Successfully localized and described multiple regions in images.

03

Demonstrated effectiveness on the large-scale Visual Genome dataset.

Abstract

We introduce the dense captioning task, which requires a computer vision system to both localize and describe salient regions in images in natural language. The dense captioning task generalizes object detection when the descriptions consist of a single word, and Image Captioning when one predicted region covers the full image. To address the localization and description task jointly we propose a Fully Convolutional Localization Network (FCLN) architecture that processes an image with a single, efficient forward pass, requires no external regions proposals, and can be trained end-to-end with a single round of optimization. The architecture is composed of a Convolutional Network, a novel dense localization layer, and Recurrent Neural Network language model that generates the label sequences. We evaluate our network on the Visual Genome dataset, which comprises 94,000 images and 4,100,000…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jcjohnson/densecap
torchOfficial

Videos

DenseCap: Fully Convolutional Localization Networks for Dense Captioning· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings