Visually Grounded Commonsense Knowledge Acquisition
Yuan Yao, Tianyu Yu, Ao Zhang, Mengdi Li, Ruobing Xie, Cornelius, Weber, Zhiyuan Liu, Hai-Tao Zheng, Stefan Wermter, Tat-Seng Chua, Maosong Sun

TL;DR
This paper introduces CLEVER, a novel vision-language approach that automatically extracts grounded commonsense knowledge from images using multi-instance learning and contrastive attention, significantly outperforming previous text-based methods.
Contribution
CLEVER formulates commonsense knowledge extraction as a distantly supervised multi-instance learning task leveraging vision-language models, introducing a contrastive attention mechanism for improved accuracy.
Findings
Outperforms language model-based methods by 3.9 AUC and 6.4 mAUC.
Achieves a 0.78 Spearman correlation with human judgments.
Provides interpretable grounding of commonsense in images.
Abstract
Large-scale commonsense knowledge bases empower a broad range of AI applications, where the automatic extraction of commonsense knowledge (CKE) is a fundamental and challenging problem. CKE from text is known for suffering from the inherent sparsity and reporting bias of commonsense in text. Visual perception, on the other hand, contains rich commonsense knowledge about real-world entities, e.g., (person, can_hold, bottle), which can serve as promising sources for acquiring grounded commonsense knowledge. In this work, we present CLEVER, which formulates CKE as a distantly supervised multi-instance learning problem, where models learn to summarize commonsense relations from a bag of images about an entity pair without any human annotation on image instances. To address the problem, CLEVER leverages vision-language pre-training models for deep understanding of each image in the bag, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Biomedical Text Mining and Ontologies
