Can We Build Scene Graphs, Not Classify Them? FlowSG: Progressive Image-Conditioned Scene Graph Generation with Flow Matching

Xin Hu; Ke Qin; Wen Yin; Yuan-Fang Li; Ming Li; Tao He

arXiv:2604.18623·cs.CV·April 22, 2026

Can We Build Scene Graphs, Not Classify Them? FlowSG: Progressive Image-Conditioned Scene Graph Generation with Flow Matching

Xin Hu, Ke Qin, Wen Yin, Yuan-Fang Li, Ming Li, Tao He

PDF

TL;DR

FlowSG introduces a progressive, generative approach to scene graph generation that models the task as continuous-time transport, improving over traditional one-shot classification methods.

Contribution

It recasts scene graph generation as a flow-based, progressive process combining discrete tokens and continuous geometry, enabling more accurate and flexible graph synthesis.

Findings

01

Achieves consistent improvements in predicate and graph-level metrics on VG and PSG datasets.

02

Demonstrates the effectiveness of flow-matching losses and discrete tokens in scene graph generation.

03

Outperforms state-of-the-art methods with about 3 points average gain.

Abstract

Scene Graph Generation (SGG) unifies object localization and visual relationship reasoning by predicting boxes and subject-predicate-object triples. Yet most pipelines treat SGG as a one-shot, deterministic classification problem rather than a genuinely progressive, generative task. We propose FlowSG, which recasts SGG as continuous-time transport on a hybrid discrete-continuous state: starting from a noised graph, the model progressively grows an image-conditioned scene graph through constraint-aware refinements that jointly synthesize nodes (objects) and edges (predicates). Specifically, we first leverage a VQ-VAE to quantize a scene graph (e.g., continuous visual features) into compact, predictable tokens; a graph Transformer then (i) predicts a conditional velocity field to transport continuous geometry (boxes) and (ii) updates discrete posteriors for categorical tokens (object…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.