SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning
Zhecan Wang, Haoxuan You, Liunian Harold Li, Alireza Zareian, Suji, Park, Yiqing Liang, Kai-Wei Chang, Shih-Fu Chang

TL;DR
This paper introduces SGEITL, a novel framework that enhances visual commonsense reasoning by integrating scene graph structures into multimodal Transformer models, improving understanding and reasoning capabilities.
Contribution
It proposes a multihop graph transformer and a scene-graph-aware pre-training method to incorporate scene graph information into visual-text reasoning models.
Findings
Significant performance improvements on VCR and related tasks.
Effective utilization of scene graph structures enhances reasoning accuracy.
Each proposed component contributes to the overall performance boost.
Abstract
Answering complex questions about images is an ambitious goal for machine intelligence, which requires a joint understanding of images, text, and commonsense knowledge, as well as a strong reasoning ability. Recently, multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning (VCR), by jointly understanding visual objects and text tokens through layers of cross-modality attention. However, these approaches do not utilize the rich structure of the scene and the interactions between objects which are essential in answering complex commonsense questions. We propose a Scene Graph Enhanced Image-Text Learning (SGEITL) framework to incorporate visual scene graphs in commonsense reasoning. To exploit the scene graph structure, at the model structure level, we propose a multihop graph transformer for regularizing attention interaction among hops. As for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Position-Wise Feed-Forward Layer · Residual Connection · Layer Normalization · Dropout · Label Smoothing · Byte Pair Encoding
