SceneGraphVLM: Dynamic Scene Graph Generation from Video with Vision-Language Models

Vladislav Makarov; Mark Gizetdinov; Dmitry Yudin

arXiv:2605.13667·cs.CV·May 14, 2026

SceneGraphVLM: Dynamic Scene Graph Generation from Video with Vision-Language Models

Vladislav Makarov, Mark Gizetdinov, Dmitry Yudin

PDF

1 Repo

TL;DR

SceneGraphVLM introduces a compact, efficient method for generating scene graphs from images and videos using small vision-language models, balancing quality and speed.

Contribution

It proposes a two-stage training approach with hallucination-aware rewards and a token-efficient graph serialization format for improved scene graph generation.

Findings

01

Achieves a strong quality-speed trade-off with approximately one-second latency.

02

Improves precision-oriented scene graph metrics while maintaining reasonable recall.

03

Supports conditioning on previous frames for lightweight video scene graph generation.

Abstract

Scene graph generation provides a compact structured representation for visual perception, but accurate and fast graph prediction from images and videos remains challenging. Recent VLM-based methods can generate scene graphs end-to-end as structured text, yet often produce long outputs with irrelevant objects and relations. We present SceneGraphVLM, a compact method for image and video scene graph generation with small visual language models. SceneGraphVLM serializes graphs in a token-efficient TOON format and trains the model in two stages: supervised fine-tuning followed by reinforcement learning with hallucination-aware rewards that balance relation coverage and precision while penalizing unsupported objects and relations. For videos, the model can optionally condition each frame on the previously generated graph, providing lightweight short-term context without tracking or…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

markus0440/SceneGraphVLM.git
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.