Visual Genome: Connecting Language and Vision Using Crowdsourced Dense   Image Annotations

Ranjay Krishna; Yuke Zhu; Oliver Groth; Justin Johnson; Kenji Hata,; Joshua Kravitz; Stephanie Chen; Yannis Kalantidis; Li-Jia Li; David A.; Shamma; Michael S. Bernstein; Fei-Fei Li

arXiv:1602.07332·cs.CV·February 25, 2016·251 cites

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata,, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A., Shamma, Michael S. Bernstein, Fei-Fei Li

PDF

Open Access 2 Repos

TL;DR

The paper introduces the Visual Genome dataset, a large, densely annotated image dataset linking objects, attributes, and relationships to improve models' understanding for cognitive vision tasks.

Contribution

It provides the largest dense annotations of objects, attributes, and relationships in images, enabling better reasoning in vision models compared to prior datasets.

Findings

01

Over 100K images with detailed annotations

02

Average of 21 objects, 18 attributes, and 18 relationships per image

03

Annotations linked to WordNet for standardized understanding

Abstract

Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in an image. When asked "What vehicle is the person riding?", computers will need to identify the objects in an image as well as the relationships riding(man, carriage) and pulling(horse, carriage) in order to answer correctly that "the person is riding a horse-drawn carriage". In this paper, we present the Visual Genome dataset to enable the modeling of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques