TL;DR
This paper introduces RVL-BERT, a multimodal transformer model that leverages visual and linguistic commonsense knowledge for improved visual relationship detection, with modules capturing spatial info and decoupling detection from recognition.
Contribution
It proposes a novel multimodal transformer architecture with spatial and mask attention modules, enabling effective visual relationship reasoning using external knowledge.
Findings
Achieves competitive results on challenging datasets.
Effectively incorporates visual-linguistic commonsense knowledge.
Decouples object detection from relationship recognition.
Abstract
Visual relationship detection aims to reason over relationships among salient objects in images, which has drawn increasing attention over the past few years. Inspired by human reasoning mechanisms, it is believed that external visual commonsense knowledge is beneficial for reasoning visual relationships of objects in images, which is however rarely considered in existing methods. In this paper, we propose a novel approach named Relational Visual-Linguistic Bidirectional Encoder Representations from Transformers (RVL-BERT), which performs relational reasoning with both visual and language commonsense knowledge learned via self-supervised pre-training with multimodal representations. RVL-BERT also uses an effective spatial module and a novel mask attention module to explicitly capture spatial information among the objects. Moreover, our model decouples object detection from visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
