KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for Visual   Commonsense Reasoning

Dandan Song; Siyi Ma; Zhanchen Sun; Sicheng Yang; Lejian Liao

arXiv:2012.07000·cs.AI·December 15, 2020·6 cites

KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for Visual Commonsense Reasoning

Dandan Song, Siyi Ma, Zhanchen Sun, Sicheng Yang, Lejian Liao

PDF

Open Access

TL;DR

KVL-BERT enhances visual commonsense reasoning by integrating external knowledge from ConceptNet into a BERT-based model, significantly improving performance on VCR tasks.

Contribution

This paper introduces KVL-BERT, a novel model that incorporates external commonsense knowledge into a visual-linguistic BERT for better reasoning.

Findings

01

KVL-BERT outperforms existing models on VCR benchmarks.

02

Incorporating commonsense knowledge improves reasoning accuracy.

03

The method effectively integrates external knowledge without disrupting original representations.

Abstract

Reasoning is a critical ability towards complete visual understanding. To develop machine with cognition-level visual understanding and reasoning abilities, the visual commonsense reasoning (VCR) task has been introduced. In VCR, given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer. The methods adopting the powerful BERT model as the backbone for learning joint representation of image content and natural language have shown promising improvements on VCR. However, none of the existing methods have utilized commonsense knowledge in visual commonsense reasoning, which we believe will be greatly helpful in this task. With the support of commonsense knowledge, complex questions even if the required information is not depicted in the image can be answered with cognitive reasoning. Therefore, we incorporate commonsense…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Linear Warmup With Linear Decay · Attention Is All You Need · Byte Pair Encoding · Layer Normalization · Dropout · Weight Decay · Label Smoothing