Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship   Detection

Tim Salzmann; Markus Ryll; Alex Bewley; Matthias Minderer

arXiv:2403.14270·cs.CV·July 22, 2024·1 cites

Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection

Tim Salzmann, Markus Ryll, Alex Bewley, Matthias Minderer

PDF

Open Access

TL;DR

This paper introduces a simple, decoder-free Transformer-based architecture for open-vocabulary visual relationship detection, achieving state-of-the-art results efficiently and enabling end-to-end training without complex modules.

Contribution

The authors propose a novel, decoder-free Transformer model that implicitly models relationships and uses an attention mechanism for object pairing, simplifying and improving visual relationship detection.

Findings

01

State-of-the-art performance on Visual Genome and GQA benchmarks.

02

Real-time inference speeds achieved.

03

Effective zero-shot relationship detection demonstrated.

Abstract

Visual relationship detection aims to identify objects and their relationships in images. Prior methods approach this task by adding separate relationship modules or decoders to existing object detection architectures. This separation increases complexity and hinders end-to-end training, which limits performance. We propose a simple and highly efficient decoder-free architecture for open-vocabulary visual relationship detection. Our model consists of a Transformer-based image encoder that represents objects as tokens and models their relationships implicitly. To extract relationship information, we introduce an attention mechanism that selects object pairs likely to form a relationship. We provide a single-stage recipe to train this model on a mixture of object and relationship detection data. Our approach achieves state-of-the-art relationship detection performance on Visual Genome and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Advanced Image and Video Retrieval Techniques