Deep Visual-Semantic Alignments for Generating Image Descriptions

Andrej Karpathy; Li Fei-Fei

arXiv:1412.2306·cs.CV·April 15, 2015·147 cites

Deep Visual-Semantic Alignments for Generating Image Descriptions

Andrej Karpathy, Li Fei-Fei

PDF

Open Access 4 Repos 4 Datasets

TL;DR

This paper introduces a deep learning model that aligns images and text to generate accurate and detailed image descriptions, achieving state-of-the-art results on multiple benchmark datasets.

Contribution

The paper presents a novel multimodal alignment model combining CNNs and RNNs to improve image captioning and region description generation.

Findings

01

State-of-the-art retrieval performance on Flickr8K, Flickr30K, MSCOCO datasets

02

Generated descriptions outperform baselines in accuracy and detail

03

Effective alignment of visual regions with language descriptions

Abstract

We present a model that generates natural language descriptions of images and their regions. Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data. Our alignment model is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Multimodal Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate that our alignment model produces state of the art results in retrieval experiments on Flickr8K, Flickr30K and MSCOCO datasets. We then show that the generated descriptions significantly outperform retrieval baselines on both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning