SceneLLM: Implicit Language Reasoning in LLM for Dynamic Scene Graph   Generation

Hang Zhang; Zhuoling Li; Jun Liu

arXiv:2412.11026·cs.CV·May 8, 2025

SceneLLM: Implicit Language Reasoning in LLM for Dynamic Scene Graph Generation

Hang Zhang, Zhuoling Li, Jun Liu

PDF

Open Access

TL;DR

SceneLLM introduces a novel framework leveraging large language models for dynamic scene graph generation by transforming video data into linguistic signals, encoding spatial info, and fine-tuning for improved scene understanding.

Contribution

The paper presents SceneLLM, a new approach that uses LLMs with a Video-to-Language module, spatial encoding, and implicit language signals for dynamic scene graph generation.

Findings

01

Achieves state-of-the-art results on Action Genome benchmark.

02

Effectively encodes spatio-temporal information into linguistic signals.

03

Demonstrates improved scene understanding and graph accuracy.

Abstract

Dynamic scenes contain intricate spatio-temporal information, crucial for mobile robots, UAVs, and autonomous driving systems to make informed decisions. Parsing these scenes into semantic triplets <Subject-Predicate-Object> for accurate Scene Graph Generation (SGG) is highly challenging due to the fluctuating spatio-temporal complexity. Inspired by the reasoning capabilities of Large Language Models (LLMs), we propose SceneLLM, a novel framework that leverages LLMs as powerful scene analyzers for dynamic SGG. Our framework introduces a Video-to-Language (V2L) mapping module that transforms video frames into linguistic signals (scene tokens), making the input more comprehensible for LLMs. To better encode spatial information, we devise a Spatial Information Aggregation (SIA) scheme, inspired by the structure of Chinese characters, which encodes spatial data into tokens. Using Optimal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Semantic Web and Ontologies · Multimodal Machine Learning Applications