LLaVA-SpaceSGG: Visual Instruct Tuning for Open-vocabulary Scene Graph   Generation with Enhanced Spatial Relations

Mingjie Xu; Mengyang Wu; Yuzhi Zhao; Jason Chun Lok Li; Weifeng Ou

arXiv:2412.06322·cs.CV·December 10, 2024

LLaVA-SpaceSGG: Visual Instruct Tuning for Open-vocabulary Scene Graph Generation with Enhanced Spatial Relations

Mingjie Xu, Mengyang Wu, Yuzhi Zhao, Jason Chun Lok Li, Weifeng Ou

PDF

Open Access 1 Repo 1 Models

TL;DR

LLaVA-SpaceSGG is a multimodal large language model designed for open-vocabulary scene graph generation, effectively modeling spatial relations and improving scene understanding in complex vision tasks.

Contribution

The paper introduces a new dataset, SpaceSGG, and a two-stage training paradigm for LLaVA-SpaceSGG, enhancing open-vocabulary SGG with better spatial relation modeling.

Findings

01

Outperforms existing open-vocabulary SGG methods

02

Boosts recall by 8.6%

03

Increases mean recall by 28.4%

Abstract

Scene Graph Generation (SGG) converts visual scenes into structured graph representations, providing deeper scene understanding for complex vision tasks. However, existing SGG models often overlook essential spatial relationships and struggle with generalization in open-vocabulary contexts. To address these limitations, we propose LLaVA-SpaceSGG, a multimodal large language model (MLLM) designed for open-vocabulary SGG with enhanced spatial relation modeling. To train it, we collect the SGG instruction-tuning dataset, named SpaceSGG. This dataset is constructed by combining publicly available datasets and synthesizing data using open-source models within our data construction pipeline. It combines object locations, object relations, and depth information, resulting in three data formats: spatial SGG description, question-answering, and conversation. To enhance the transfer of MLLMs'…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

endlinc/llava-spacesgg
noneOfficial

Models

🤗
wumengyangok/LLaVA-SpaceSGG
model· 24 dl
24 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Advanced Image and Video Retrieval Techniques