Stacked Hybrid-Attention and Group Collaborative Learning for Unbiased Scene Graph Generation
Xingning Dong, Tian Gan, Xuemeng Song, Jianlong Wu, Yuan Cheng,, Liqiang Nie

TL;DR
This paper introduces a novel hybrid-attention encoder and a group collaborative learning decoder to improve scene graph generation, effectively reducing bias and enhancing predicate prediction accuracy.
Contribution
It proposes a stacked hybrid-attention network for better modality fusion and a group collaborative learning strategy to address class imbalance in scene graph generation.
Findings
Achieved state-of-the-art unbiased metric performance on VG and GQA datasets.
Nearly doubled performance compared to baseline methods.
Effectively reduces bias in predicate prediction.
Abstract
Scene Graph Generation, which generally follows a regular encoder-decoder pipeline, aims to first encode the visual contents within the given image and then parse them into a compact summary graph. Existing SGG approaches generally not only neglect the insufficient modality fusion between vision and language, but also fail to provide informative predicates due to the biased relationship predictions, leading SGG far from practical. Towards this end, in this paper, we first present a novel Stacked Hybrid-Attention network, which facilitates the intra-modal refinement as well as the inter-modal interaction, to serve as the encoder. We then devise an innovative Group Collaborative Learning strategy to optimize the decoder. Particularly, based upon the observation that the recognition capability of one classifier is limited towards an extremely unbalanced dataset, we first deploy a group of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
