TL;DR
This paper introduces an energy-based learning framework for scene graph generation that captures structural information, improving performance and data efficiency, especially in low-data scenarios.
Contribution
The paper proposes a novel energy-based framework that incorporates scene graph structure into learning, enhancing existing models' performance and data efficiency.
Findings
Up to 21% performance improvement on Visual Genome
Up to 27% performance improvement on GQA
Effective in zero- and few-shot learning scenarios
Abstract
Traditional scene graph generation methods are trained using cross-entropy losses that treat objects and relationships as independent entities. Such a formulation, however, ignores the structure in the output space, in an inherently structured prediction problem. In this work, we introduce a novel energy-based learning framework for generating scene graphs. The proposed formulation allows for efficiently incorporating the structure of scene graphs in the output space. This additional constraint in the learning framework acts as an inductive bias and allows models to learn efficiently from a small number of labels. We use the proposed energy-based framework to train existing state-of-the-art models and obtain a significant performance improvement, of up to 21% and 27%, on the Visual Genome and GQA benchmark datasets, respectively. Furthermore, we showcase the learning efficiency of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsRegion Proposal Network · Convolution · Softmax · RoIAlign · Mask R-CNN
