Visually Grounded Commonsense Knowledge Acquisition

Yuan Yao; Tianyu Yu; Ao Zhang; Mengdi Li; Ruobing Xie; Cornelius; Weber; Zhiyuan Liu; Hai-Tao Zheng; Stefan Wermter; Tat-Seng Chua; Maosong Sun

arXiv:2211.12054·cs.CV·March 28, 2023·1 cites

Visually Grounded Commonsense Knowledge Acquisition

Yuan Yao, Tianyu Yu, Ao Zhang, Mengdi Li, Ruobing Xie, Cornelius, Weber, Zhiyuan Liu, Hai-Tao Zheng, Stefan Wermter, Tat-Seng Chua, Maosong Sun

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces CLEVER, a novel vision-language approach that automatically extracts grounded commonsense knowledge from images using multi-instance learning and contrastive attention, significantly outperforming previous text-based methods.

Contribution

CLEVER formulates commonsense knowledge extraction as a distantly supervised multi-instance learning task leveraging vision-language models, introducing a contrastive attention mechanism for improved accuracy.

Findings

01

Outperforms language model-based methods by 3.9 AUC and 6.4 mAUC.

02

Achieves a 0.78 Spearman correlation with human judgments.

03

Provides interpretable grounding of commonsense in images.

Abstract

Large-scale commonsense knowledge bases empower a broad range of AI applications, where the automatic extraction of commonsense knowledge (CKE) is a fundamental and challenging problem. CKE from text is known for suffering from the inherent sparsity and reporting bias of commonsense in text. Visual perception, on the other hand, contains rich commonsense knowledge about real-world entities, e.g., (person, can_hold, bottle), which can serve as promising sources for acquiring grounded commonsense knowledge. In this work, we present CLEVER, which formulates CKE as a distantly supervised multi-instance learning problem, where models learn to summarize commonsense relations from a bag of images about an entity pair without any human annotation on image instances. To address the problem, CLEVER leverages vision-language pre-training models for deep understanding of each image in the bag, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thunlp/clever
pytorchOfficial

Videos

Visually Grounded Commonsense Knowledge Acquisition· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Biomedical Text Mining and Ontologies