TL;DR
This paper introduces a novel Triplet Calibration and Reduction framework for zero-shot Scene Graph Generation, improving the model's ability to generalize to unseen triplets by regularizing representations and focusing on reasonable unseen compositions.
Contribution
The paper proposes a triplet calibration loss, an unseen space reduction loss, and a contextual encoder to enhance zero-shot scene graph generation capabilities.
Findings
Achieves consistent improvements over state-of-the-art methods in zero-shot SGG.
Effectively regularizes triplet representations to discover unseen triplets.
Reduces the unseen space to focus on plausible unseen compositions.
Abstract
Scene Graph Generation (SGG) plays a pivotal role in downstream vision-language tasks. Existing SGG methods typically suffer from poor compositional generalizations on unseen triplets. They are generally trained on incompletely annotated scene graphs that contain dominant triplets and tend to bias toward these seen triplets during inference. To address this issue, we propose a Triplet Calibration and Reduction (T-CAR) framework in this paper. In our framework, a triplet calibration loss is first presented to regularize the representations of diverse triplets and to simultaneously excavate the unseen triplets in incompletely annotated training scene graphs. Moreover, the unseen space of scene graphs is usually several times larger than the seen space since it contains a huge number of unrealistic compositions. Thus, we propose an unseen space reduction loss to shift the attention of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
