VReBERT: A Simple and Flexible Transformer for Visual Relationship   Detection

Yu Cui; Moshiur Farazi

arXiv:2206.09111·cs.CV·June 22, 2022·1 cites

VReBERT: A Simple and Flexible Transformer for Visual Relationship Detection

Yu Cui, Moshiur Farazi

PDF

Open Access

TL;DR

VReBERT introduces a simple, BERT-like transformer model for visual relationship detection that jointly processes visual and semantic features, outperforming state-of-the-art models and significantly improving zero-shot predicate prediction.

Contribution

The paper presents VReBERT, a novel transformer-based VRD model that jointly learns visual and semantic features, enhancing predicate prediction accuracy and zero-shot generalization.

Findings

01

Outperforms state-of-the-art VRD models in predicate prediction.

02

Achieves +8.49 R@50 and +8.99 R@100 in zero-shot predicate prediction.

03

Uses a multi-stage training strategy for joint visual and semantic feature processing.

Abstract

Visual Relationship Detection (VRD) impels a computer vision model to 'see' beyond an individual object instance and 'understand' how different objects in a scene are related. The traditional way of VRD is first to detect objects in an image and then separately predict the relationship between the detected object instances. Such a disjoint approach is prone to predict redundant relationship tags (i.e., predicate) between the same object pair with similar semantic meaning, or incorrect ones that have a similar meaning to the ground truth but are semantically incorrect. To remedy this, we propose to jointly train a VRD model with visual object features and semantic relationship features. To this end, we propose VReBERT, a BERT-like transformer model for Visual Relationship Detection with a multi-stage training strategy to jointly process visual and semantic features. We show that our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques