Improving Commonsense in Vision-Language Models via Knowledge Graph   Riddles

Shuquan Ye; Yujia Xie; Dongdong Chen; Yichong Xu; Lu Yuan; and Chenguang Zhu; Jing Liao

arXiv:2211.16504·cs.CV·November 30, 2022·1 cites

Improving Commonsense in Vision-Language Models via Knowledge Graph Riddles

Shuquan Ye, Yujia Xie, Dongdong Chen, Yichong Xu, Lu Yuan, and Chenguang Zhu, Jing Liao

PDF

Open Access 1 Repo

TL;DR

This paper introduces DANCE, a scalable data augmentation method using knowledge graph linearization to enhance commonsense reasoning in vision-language models, validated by a new diagnostic benchmark.

Contribution

We propose DANCE, a novel data augmentation technique leveraging knowledge graphs to improve commonsense in VL models without additional dataset collection.

Findings

01

DANCE significantly improves commonsense reasoning in VL models.

02

DANCE maintains performance on standard retrieval tasks.

03

A new retrieval-based benchmark evaluates commonsense capabilities.

Abstract

This paper focuses on analyzing and improving the commonsense ability of recent popular vision-language (VL) models. Despite the great success, we observe that existing VL-models still lack commonsense knowledge/reasoning ability (e.g., "Lemons are sour"), which is a vital component towards artificial general intelligence. Through our analysis, we find one important reason is that existing large-scale VL datasets do not contain much commonsense knowledge, which motivates us to improve the commonsense of VL-models from the data perspective. Rather than collecting a new VL training dataset, we propose a more scalable strategy, i.e., "Data Augmentation with kNowledge graph linearization for CommonsensE capability" (DANCE). It can be viewed as one type of data augmentation technique, which can inject commonsense knowledge into existing VL datasets on the fly during training. More…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pleaseconnectwifi/dance
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsDomain Adaptative Neighborhood Clustering via Entropy Optimization