Vision Relation Transformer for Unbiased Scene Graph Generation
Gopika Sudhakaran, Devendra Singh Dhami, Kristian Kersting, Stefan, Roth

TL;DR
This paper introduces VETO, a novel relation encoder, and MEET, a learning strategy, to improve scene graph generation by reducing information loss and bias, significantly boosting accuracy and model efficiency.
Contribution
The paper proposes VETO and MEET, novel methods that address information loss and bias in scene graph generation, achieving state-of-the-art performance with smaller models.
Findings
VETO + MEET improves SGG accuracy by up to 47%.
The combined approach reduces model size by 10 times.
Experimental results on VG and GQA datasets validate effectiveness.
Abstract
Recent years have seen a growing interest in Scene Graph Generation (SGG), a comprehensive visual scene understanding task that aims to predict entity relationships using a relation encoder-decoder pipeline stacked on top of an object encoder-decoder backbone. Unfortunately, current SGG methods suffer from an information loss regarding the entities local-level cues during the relation encoding process. To mitigate this, we introduce the Vision rElation TransfOrmer (VETO), consisting of a novel local-level entity relation encoder. We further observe that many existing SGG methods claim to be unbiased, but are still biased towards either head or tail classes. To overcome this bias, we introduce a Mutually Exclusive ExperT (MEET) learning strategy that captures important relation features without bias towards head or tail classes. Experimental results on the VG and GQA datasets demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Vision Relation Transformer for Unbiased Scene Graph Generation· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
