VLC-BERT: Visual Question Answering with Contextualized Commonsense   Knowledge

Sahithya Ravi; Aditya Chinchure; Leonid Sigal; Renjie Liao; Vered; Shwartz

arXiv:2210.13626·cs.CV·October 26, 2022·1 cites

VLC-BERT: Visual Question Answering with Contextualized Commonsense Knowledge

Sahithya Ravi, Aditya Chinchure, Leonid Sigal, Renjie Liao, Vered, Shwartz

PDF

Open Access 1 Repo 2 Videos

TL;DR

VLC-BERT is a novel vision-language model that incorporates contextualized commonsense knowledge from COMET to improve visual question answering, especially for knowledge-intensive questions, outperforming models using static knowledge bases.

Contribution

This work introduces VLC-BERT, a new pre-trained transformer that integrates contextualized commonsense knowledge with visual and textual information for VQA tasks.

Findings

01

VLC-BERT outperforms static knowledge base models on OK-VQA and A-OKVQA datasets.

02

Contextualized knowledge benefits certain question types more than others.

03

Analysis reveals which questions gain from COMET-based knowledge integration.

Abstract

There has been a growing interest in solving Visual Question Answering (VQA) tasks that require the model to reason beyond the content present in the image. In this work, we focus on questions that require commonsense reasoning. In contrast to previous methods which inject knowledge from static knowledge bases, we investigate the incorporation of contextualized knowledge using Commonsense Transformer (COMET), an existing knowledge model trained on human-curated knowledge bases. We propose a method to generate, select, and encode external commonsense knowledge alongside visual and textual cues in a new pre-trained Vision-Language-Commonsense transformer model, VLC-BERT. Through our evaluation on the knowledge-intensive OK-VQA and A-OKVQA datasets, we show that VLC-BERT is capable of outperforming existing models that utilize static knowledge bases. Furthermore, through a detailed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aditya10/vlc-bert
pytorchOfficial

Videos

VLC-BERT: Visual Question Answering with Contextualized Commonsense Knowledge· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Label Smoothing · Absolute Position Encodings · Layer Normalization