Multimodal Commonsense Knowledge Distillation for Visual Question   Answering

Shuo Yang; Siwen Luo; Soyeon Caren Han

arXiv:2411.02722·cs.CL·November 6, 2024

Multimodal Commonsense Knowledge Distillation for Visual Question Answering

Shuo Yang, Siwen Luo, Soyeon Caren Han

PDF

Open Access

TL;DR

This paper introduces a graph-based knowledge distillation framework that enhances visual question answering models by integrating multimodal commonsense knowledge without extensive fine-tuning.

Contribution

It proposes a novel GCN-based knowledge distillation method that constructs a unified relational graph for multimodal commonsense reasoning in VQA.

Findings

01

Achieved competitive performance on ScienceQA dataset.

02

Flexible framework compatible with various teacher and student models.

03

Reduces need for high-cost fine-tuning in multimodal VQA models.

Abstract

Existing Multimodal Large Language Models (MLLMs) and Visual Language Pretrained Models (VLPMs) have shown remarkable performances in the general Visual Question Answering (VQA). However, these models struggle with VQA questions that require external commonsense knowledge due to the challenges in generating high-quality prompts and the high computational costs of fine-tuning. In this work, we propose a novel graph-based multimodal commonsense knowledge distillation framework that constructs a unified relational graph over commonsense knowledge, visual objects and questions through a Graph Convolutional Network (GCN) following a teacher-student environment. This proposed framework is flexible with any type of teacher and student models without further fine-tuning, and has achieved competitive performances on the ScienceQA dataset.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Speech and dialogue systems

MethodsKnowledge Distillation