Vision Relation Transformer for Unbiased Scene Graph Generation

Gopika Sudhakaran; Devendra Singh Dhami; Kristian Kersting; Stefan; Roth

arXiv:2308.09472·cs.CV·August 21, 2023·1 cites

Vision Relation Transformer for Unbiased Scene Graph Generation

Gopika Sudhakaran, Devendra Singh Dhami, Kristian Kersting, Stefan, Roth

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces VETO, a novel relation encoder, and MEET, a learning strategy, to improve scene graph generation by reducing information loss and bias, significantly boosting accuracy and model efficiency.

Contribution

The paper proposes VETO and MEET, novel methods that address information loss and bias in scene graph generation, achieving state-of-the-art performance with smaller models.

Findings

01

VETO + MEET improves SGG accuracy by up to 47%.

02

The combined approach reduces model size by 10 times.

03

Experimental results on VG and GQA datasets validate effectiveness.

Abstract

Recent years have seen a growing interest in Scene Graph Generation (SGG), a comprehensive visual scene understanding task that aims to predict entity relationships using a relation encoder-decoder pipeline stacked on top of an object encoder-decoder backbone. Unfortunately, current SGG methods suffer from an information loss regarding the entities local-level cues during the relation encoding process. To mitigate this, we introduce the Vision rElation TransfOrmer (VETO), consisting of a novel local-level entity relation encoder. We further observe that many existing SGG methods claim to be unbiased, but are still biased towards either head or tail classes. To overcome this bias, we introduce a Mutually Exclusive ExperT (MEET) learning strategy that captures important relation features without bias towards head or tail classes. Experimental results on the VG and GQA datasets demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

visinf/veto
pytorchOfficial

Videos

Vision Relation Transformer for Unbiased Scene Graph Generation· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning