Reasoning Visual Dialog with Sparse Graph Learning and Knowledge Transfer
Gi-Cheon Kang, Junseok Park, Hwaran Lee, Byoung-Tak Zhang, Jin-Hwa Kim

TL;DR
This paper introduces a novel Sparse Graph Learning approach combined with Knowledge Transfer to enhance reasoning and answer diversity in visual dialog systems, significantly outperforming existing methods on the VisDial v1.0 dataset.
Contribution
It proposes a new Sparse Graph Learning framework for dialog reasoning and a Knowledge Transfer technique to improve answer diversity, addressing key challenges in visual dialog understanding.
Findings
Outperforms state-of-the-art on VisDial v1.0 dataset
Enhances reasoning capabilities in visual dialog models
Increases diversity of generated answers
Abstract
Visual dialog is a task of answering a sequence of questions grounded in an image using the previous dialog history as context. In this paper, we study how to address two fundamental challenges for this task: (1) reasoning over underlying semantic structures among dialog rounds and (2) identifying several appropriate answers to the given question. To address these challenges, we propose a Sparse Graph Learning (SGL) method to formulate visual dialog as a graph structure learning task. SGL infers inherently sparse dialog structures by incorporating binary and score edges and leveraging a new structural loss function. Next, we introduce a Knowledge Transfer (KT) method that extracts the answer predictions from the teacher model and uses them as pseudo labels. We propose KT to remedy the shortcomings of single ground-truth labels, which severely limit the ability of a model to obtain…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
MethodsInterpretability · Softmax
