SGG-R$^{\rm 3}$: From Next-Token Prediction to End-to-End Unbiased Scene Graph Generation
Jiaye Feng, Qixiang Yin, Yuankun Liu, Tong Mo, Weiping Li

TL;DR
SGG-R$^{ m 3}$ introduces an end-to-end structured reasoning framework for unbiased scene graph generation, combining supervised fine-tuning and reinforcement learning to improve relation coverage and mitigate bias.
Contribution
The paper presents a novel multi-stage framework integrating chain-of-thought guided fine-tuning and group sequence policy optimization for unbiased scene graph generation.
Findings
Outperforms existing methods on two benchmarks.
Effectively mitigates long-tail relation bias.
Enhances relation coverage through semantic clustering.
Abstract
Scene Graph Generation (SGG) structures visual scenes as graphs of objects and their relations. While Multimodal Large Language Models (MLLMs) have advanced end-to-end SGG, current methods are hindered by both a lack of task-specific structured reasoning and the challenges of sparse, long-tailed relation distributions, resulting in incomplete scene graphs characterized by low recall and biased predictions. To address these issues, we introduce SGG-R, a structured reasoning framework that integrates task-specific chain-of-thought (CoT)-guided supervised fine-tuning (SFT) and reinforcement learning (RL) with group sequence policy optimization (GSPO), designed to engage in three sequential stages to achieve end-to-end unbiased scene graph generation. During the SFT phase, we propose a relation augmentation strategy by leveraging an MLLM and refined via embedding similarity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
