VCD: A Dataset for Visual Commonsense Discovery in Images
Xiangqing Shen, Fanfan Wang, Siwei Wu, Rui Xia

TL;DR
VCD is a large-scale dataset that provides structured visual commonsense knowledge across images, enabling improved reasoning about unseen and observable aspects of visual scenes.
Contribution
The paper introduces VCD, a novel dataset with a three-level taxonomy for visual commonsense, and a generative model VCM for discovering diverse visual commonsense.
Findings
VCD contains over 100,000 images and 14 million object-commonsense pairs.
VCD's taxonomy covers Seen and Unseen commonsense in Property, Action, and Space.
VCM effectively discovers diverse visual commonsense from images.
Abstract
Visual commonsense plays a vital role in understanding and reasoning about the visual world. While commonsense knowledge bases like ConceptNet provide structured collections of general facts, they lack visually grounded representations. Scene graph datasets like Visual Genome, though rich in object-level descriptions, primarily focus on directly observable information and lack systematic categorization of commonsense knowledge. We present Visual Commonsense Dataset (VCD), a large-scale dataset containing over 100,000 images and 14 million object-commonsense pairs that bridges this gap. VCD introduces a novel three-level taxonomy for visual commonsense, integrating both Seen (directly observable) and Unseen (inferrable) commonsense across Property, Action, and Space aspects. Each commonsense is represented as a triple where the head entity is grounded to object bounding boxes in images,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
MethodsBalanced Selection
