KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for Visual Commonsense Reasoning
Dandan Song, Siyi Ma, Zhanchen Sun, Sicheng Yang, Lejian Liao

TL;DR
KVL-BERT enhances visual commonsense reasoning by integrating external knowledge from ConceptNet into a BERT-based model, significantly improving performance on VCR tasks.
Contribution
This paper introduces KVL-BERT, a novel model that incorporates external commonsense knowledge into a visual-linguistic BERT for better reasoning.
Findings
KVL-BERT outperforms existing models on VCR benchmarks.
Incorporating commonsense knowledge improves reasoning accuracy.
The method effectively integrates external knowledge without disrupting original representations.
Abstract
Reasoning is a critical ability towards complete visual understanding. To develop machine with cognition-level visual understanding and reasoning abilities, the visual commonsense reasoning (VCR) task has been introduced. In VCR, given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer. The methods adopting the powerful BERT model as the backbone for learning joint representation of image content and natural language have shown promising improvements on VCR. However, none of the existing methods have utilized commonsense knowledge in visual commonsense reasoning, which we believe will be greatly helpful in this task. With the support of commonsense knowledge, complex questions even if the required information is not depicted in the image can be answered with cognitive reasoning. Therefore, we incorporate commonsense…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Linear Warmup With Linear Decay · Attention Is All You Need · Byte Pair Encoding · Layer Normalization · Dropout · Weight Decay · Label Smoothing
