DenseCap: Fully Convolutional Localization Networks for Dense Captioning
Justin Johnson, Andrej Karpathy, Li Fei-Fei

TL;DR
This paper introduces dense captioning, a task that localizes and describes multiple regions in images, and proposes an end-to-end fully convolutional network that improves speed and accuracy on the Visual Genome dataset.
Contribution
The paper presents a novel Fully Convolutional Localization Network (FCLN) for dense captioning, enabling joint localization and description in a single, efficient model trained end-to-end.
Findings
Achieved faster inference and higher accuracy than previous methods.
Successfully localized and described multiple regions in images.
Demonstrated effectiveness on the large-scale Visual Genome dataset.
Abstract
We introduce the dense captioning task, which requires a computer vision system to both localize and describe salient regions in images in natural language. The dense captioning task generalizes object detection when the descriptions consist of a single word, and Image Captioning when one predicted region covers the full image. To address the localization and description task jointly we propose a Fully Convolutional Localization Network (FCLN) architecture that processes an image with a single, efficient forward pass, requires no external regions proposals, and can be trained end-to-end with a single round of optimization. The architecture is composed of a Convolutional Network, a novel dense localization layer, and Recurrent Neural Network language model that generates the label sequences. We evaluate our network on the Visual Genome dataset, which comprises 94,000 images and 4,100,000…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
DenseCap: Fully Convolutional Localization Networks for Dense Captioning· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
