Towards Unseen Triples: Effective Text-Image-joint Learning for Scene Graph Generation
Qianji Di, Wenxi Ma, Zhongang Qi, Tianxiang Hou, Ying Shan, Hanzi Wang

TL;DR
This paper introduces TISGG, a novel text-image joint learning model for scene graph generation that effectively predicts unseen triples and addresses dataset bias, achieving state-of-the-art results.
Contribution
The paper proposes a joint feature learning and factual knowledge refinement framework with balanced learning strategies to improve unseen triple prediction in scene graph generation.
Findings
Boosts zero-shot recall by 11.7% on Visual Genome.
Achieves state-of-the-art performance in scene graph generation.
Effectively handles long-tailed distribution and unseen triples.
Abstract
Scene Graph Generation (SGG) aims to structurally and comprehensively represent objects and their connections in images, it can significantly benefit scene understanding and other related downstream tasks. Existing SGG models often struggle to solve the long-tailed problem caused by biased datasets. However, even if these models can fit specific datasets better, it may be hard for them to resolve the unseen triples which are not included in the training set. Most methods tend to feed a whole triple and learn the overall features based on statistical machine learning. Such models have difficulty predicting unseen triples because the objects and predicates in the training set are combined differently as novel triples in the test set. In this work, we propose a Text-Image-joint Scene Graph Generation (TISGG) model to resolve the unseen triples and improve the generalisation capability of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
MethodsALIGN
