SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense   Reasoning

Zhecan Wang; Haoxuan You; Liunian Harold Li; Alireza Zareian; Suji; Park; Yiqing Liang; Kai-Wei Chang; Shih-Fu Chang

arXiv:2112.08587·cs.CV·December 17, 2021

SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning

Zhecan Wang, Haoxuan You, Liunian Harold Li, Alireza Zareian, Suji, Park, Yiqing Liang, Kai-Wei Chang, Shih-Fu Chang

PDF

Open Access 1 Video

TL;DR

This paper introduces SGEITL, a novel framework that enhances visual commonsense reasoning by integrating scene graph structures into multimodal Transformer models, improving understanding and reasoning capabilities.

Contribution

It proposes a multihop graph transformer and a scene-graph-aware pre-training method to incorporate scene graph information into visual-text reasoning models.

Findings

01

Significant performance improvements on VCR and related tasks.

02

Effective utilization of scene graph structures enhances reasoning accuracy.

03

Each proposed component contributes to the overall performance boost.

Abstract

Answering complex questions about images is an ambitious goal for machine intelligence, which requires a joint understanding of images, text, and commonsense knowledge, as well as a strong reasoning ability. Recently, multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning (VCR), by jointly understanding visual objects and text tokens through layers of cross-modality attention. However, these approaches do not utilize the rich structure of the scene and the interactions between objects which are essential in answering complex commonsense questions. We propose a Scene Graph Enhanced Image-Text Learning (SGEITL) framework to incorporate visual scene graphs in commonsense reasoning. To exploit the scene graph structure, at the model structure level, we propose a multihop graph transformer for regularizing attention interaction among hops. As for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Position-Wise Feed-Forward Layer · Residual Connection · Layer Normalization · Dropout · Label Smoothing · Byte Pair Encoding