Less is More: Toward Zero-Shot Local Scene Graph Generation via   Foundation Models

Shu Zhao; Huijuan Xu

arXiv:2310.01356·cs.CV·October 3, 2023

Less is More: Toward Zero-Shot Local Scene Graph Generation via Foundation Models

Shu Zhao, Huijuan Xu

PDF

Open Access 1 Repo

TL;DR

This paper introduces ELEGANT, a zero-shot framework for local scene graph generation using foundation models, which abstracts structural information with partial objects to enhance reasoning in downstream tasks.

Contribution

It proposes a novel zero-shot local scene graph generation task and a framework leveraging foundation models for perception and reasoning without labeled supervision.

Findings

01

Outperforms baselines in open-ended evaluation with ECLIPSE metric.

02

Achieves up to 24.58% improvement over prior methods in close-set setting.

03

Demonstrates strong reasoning capabilities of foundation models in structural understanding.

Abstract

Humans inherently recognize objects via selective visual perception, transform specific regions from the visual field into structured symbolic knowledge, and reason their relationships among regions based on the allocation of limited attention resources in line with humans' goals. While it is intuitive for humans, contemporary perception systems falter in extracting structural information due to the intricate cognitive abilities and commonsense knowledge required. To fill this gap, we present a new task called Local Scene Graph Generation. Distinct from the conventional scene graph generation task, which encompasses generating all objects and relationships in an image, our proposed task aims to abstract pertinent structural information with partial objects and their relationships for boosting downstream tasks that demand advanced comprehension and reasoning capabilities.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bowen-upenn/Multi-Agent-VQA
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques