VLC-BERT: Visual Question Answering with Contextualized Commonsense Knowledge
Sahithya Ravi, Aditya Chinchure, Leonid Sigal, Renjie Liao, Vered, Shwartz

TL;DR
VLC-BERT is a novel vision-language model that incorporates contextualized commonsense knowledge from COMET to improve visual question answering, especially for knowledge-intensive questions, outperforming models using static knowledge bases.
Contribution
This work introduces VLC-BERT, a new pre-trained transformer that integrates contextualized commonsense knowledge with visual and textual information for VQA tasks.
Findings
VLC-BERT outperforms static knowledge base models on OK-VQA and A-OKVQA datasets.
Contextualized knowledge benefits certain question types more than others.
Analysis reveals which questions gain from COMET-based knowledge integration.
Abstract
There has been a growing interest in solving Visual Question Answering (VQA) tasks that require the model to reason beyond the content present in the image. In this work, we focus on questions that require commonsense reasoning. In contrast to previous methods which inject knowledge from static knowledge bases, we investigate the incorporation of contextualized knowledge using Commonsense Transformer (COMET), an existing knowledge model trained on human-curated knowledge bases. We propose a method to generate, select, and encode external commonsense knowledge alongside visual and textual cues in a new pre-trained Vision-Language-Commonsense transformer model, VLC-BERT. Through our evaluation on the knowledge-intensive OK-VQA and A-OKVQA datasets, we show that VLC-BERT is capable of outperforming existing models that utilize static knowledge bases. Furthermore, through a detailed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
VLC-BERT: Visual Question Answering with Contextualized Commonsense Knowledge· youtube
VLC-BERT: Visual Question Answering with Contextualized Commonsense Knowledge· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Label Smoothing · Absolute Position Encodings · Layer Normalization
