Less is More: Toward Zero-Shot Local Scene Graph Generation via Foundation Models
Shu Zhao, Huijuan Xu

TL;DR
This paper introduces ELEGANT, a zero-shot framework for local scene graph generation using foundation models, which abstracts structural information with partial objects to enhance reasoning in downstream tasks.
Contribution
It proposes a novel zero-shot local scene graph generation task and a framework leveraging foundation models for perception and reasoning without labeled supervision.
Findings
Outperforms baselines in open-ended evaluation with ECLIPSE metric.
Achieves up to 24.58% improvement over prior methods in close-set setting.
Demonstrates strong reasoning capabilities of foundation models in structural understanding.
Abstract
Humans inherently recognize objects via selective visual perception, transform specific regions from the visual field into structured symbolic knowledge, and reason their relationships among regions based on the allocation of limited attention resources in line with humans' goals. While it is intuitive for humans, contemporary perception systems falter in extracting structural information due to the intricate cognitive abilities and commonsense knowledge required. To fill this gap, we present a new task called Local Scene Graph Generation. Distinct from the conventional scene graph generation task, which encompasses generating all objects and relationships in an image, our proposed task aims to abstract pertinent structural information with partial objects and their relationships for boosting downstream tasks that demand advanced comprehension and reasoning capabilities.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques
