TL;DR
SceneGraphVLM introduces a compact, efficient method for generating scene graphs from images and videos using small vision-language models, balancing quality and speed.
Contribution
It proposes a two-stage training approach with hallucination-aware rewards and a token-efficient graph serialization format for improved scene graph generation.
Findings
Achieves a strong quality-speed trade-off with approximately one-second latency.
Improves precision-oriented scene graph metrics while maintaining reasonable recall.
Supports conditioning on previous frames for lightweight video scene graph generation.
Abstract
Scene graph generation provides a compact structured representation for visual perception, but accurate and fast graph prediction from images and videos remains challenging. Recent VLM-based methods can generate scene graphs end-to-end as structured text, yet often produce long outputs with irrelevant objects and relations. We present SceneGraphVLM, a compact method for image and video scene graph generation with small visual language models. SceneGraphVLM serializes graphs in a token-efficient TOON format and trains the model in two stages: supervised fine-tuning followed by reinforcement learning with hallucination-aware rewards that balance relation coverage and precision while penalizing unsupported objects and relations. For videos, the model can optionally condition each frame on the previously generated graph, providing lightweight short-term context without tracking or…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
