Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection
Tim Salzmann, Markus Ryll, Alex Bewley, Matthias Minderer

TL;DR
This paper introduces a simple, decoder-free Transformer-based architecture for open-vocabulary visual relationship detection, achieving state-of-the-art results efficiently and enabling end-to-end training without complex modules.
Contribution
The authors propose a novel, decoder-free Transformer model that implicitly models relationships and uses an attention mechanism for object pairing, simplifying and improving visual relationship detection.
Findings
State-of-the-art performance on Visual Genome and GQA benchmarks.
Real-time inference speeds achieved.
Effective zero-shot relationship detection demonstrated.
Abstract
Visual relationship detection aims to identify objects and their relationships in images. Prior methods approach this task by adding separate relationship modules or decoders to existing object detection architectures. This separation increases complexity and hinders end-to-end training, which limits performance. We propose a simple and highly efficient decoder-free architecture for open-vocabulary visual relationship detection. Our model consists of a Transformer-based image encoder that represents objects as tokens and models their relationships implicitly. To extract relationship information, we introduce an attention mechanism that selects object pairs likely to form a relationship. We provide a single-stage recipe to train this model on a mixture of object and relationship detection data. Our approach achieves state-of-the-art relationship detection performance on Visual Genome and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Advanced Image and Video Retrieval Techniques
