VReBERT: A Simple and Flexible Transformer for Visual Relationship Detection
Yu Cui, Moshiur Farazi

TL;DR
VReBERT introduces a simple, BERT-like transformer model for visual relationship detection that jointly processes visual and semantic features, outperforming state-of-the-art models and significantly improving zero-shot predicate prediction.
Contribution
The paper presents VReBERT, a novel transformer-based VRD model that jointly learns visual and semantic features, enhancing predicate prediction accuracy and zero-shot generalization.
Findings
Outperforms state-of-the-art VRD models in predicate prediction.
Achieves +8.49 R@50 and +8.99 R@100 in zero-shot predicate prediction.
Uses a multi-stage training strategy for joint visual and semantic feature processing.
Abstract
Visual Relationship Detection (VRD) impels a computer vision model to 'see' beyond an individual object instance and 'understand' how different objects in a scene are related. The traditional way of VRD is first to detect objects in an image and then separately predict the relationship between the detected object instances. Such a disjoint approach is prone to predict redundant relationship tags (i.e., predicate) between the same object pair with similar semantic meaning, or incorrect ones that have a similar meaning to the ground truth but are semantically incorrect. To remedy this, we propose to jointly train a VRD model with visual object features and semantic relationship features. To this end, we propose VReBERT, a BERT-like transformer model for Visual Relationship Detection with a multi-stage training strategy to jointly process visual and semantic features. We show that our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
