Multimodal Commonsense Knowledge Distillation for Visual Question Answering
Shuo Yang, Siwen Luo, Soyeon Caren Han

TL;DR
This paper introduces a graph-based knowledge distillation framework that enhances visual question answering models by integrating multimodal commonsense knowledge without extensive fine-tuning.
Contribution
It proposes a novel GCN-based knowledge distillation method that constructs a unified relational graph for multimodal commonsense reasoning in VQA.
Findings
Achieved competitive performance on ScienceQA dataset.
Flexible framework compatible with various teacher and student models.
Reduces need for high-cost fine-tuning in multimodal VQA models.
Abstract
Existing Multimodal Large Language Models (MLLMs) and Visual Language Pretrained Models (VLPMs) have shown remarkable performances in the general Visual Question Answering (VQA). However, these models struggle with VQA questions that require external commonsense knowledge due to the challenges in generating high-quality prompts and the high computational costs of fine-tuning. In this work, we propose a novel graph-based multimodal commonsense knowledge distillation framework that constructs a unified relational graph over commonsense knowledge, visual objects and questions through a Graph Convolutional Network (GCN) following a teacher-student environment. This proposed framework is flexible with any type of teacher and student models without further fine-tuning, and has achieved competitive performances on the ScienceQA dataset.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Speech and dialogue systems
MethodsKnowledge Distillation
