SceneLLM: Implicit Language Reasoning in LLM for Dynamic Scene Graph Generation
Hang Zhang, Zhuoling Li, Jun Liu

TL;DR
SceneLLM introduces a novel framework leveraging large language models for dynamic scene graph generation by transforming video data into linguistic signals, encoding spatial info, and fine-tuning for improved scene understanding.
Contribution
The paper presents SceneLLM, a new approach that uses LLMs with a Video-to-Language module, spatial encoding, and implicit language signals for dynamic scene graph generation.
Findings
Achieves state-of-the-art results on Action Genome benchmark.
Effectively encodes spatio-temporal information into linguistic signals.
Demonstrates improved scene understanding and graph accuracy.
Abstract
Dynamic scenes contain intricate spatio-temporal information, crucial for mobile robots, UAVs, and autonomous driving systems to make informed decisions. Parsing these scenes into semantic triplets <Subject-Predicate-Object> for accurate Scene Graph Generation (SGG) is highly challenging due to the fluctuating spatio-temporal complexity. Inspired by the reasoning capabilities of Large Language Models (LLMs), we propose SceneLLM, a novel framework that leverages LLMs as powerful scene analyzers for dynamic SGG. Our framework introduces a Video-to-Language (V2L) mapping module that transforms video frames into linguistic signals (scene tokens), making the input more comprehensible for LLMs. To better encode spatial information, we devise a Spatial Information Aggregation (SIA) scheme, inspired by the structure of Chinese characters, which encodes spatial data into tokens. Using Optimal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Semantic Web and Ontologies · Multimodal Machine Learning Applications
