Synthetic Visual Genome

Jae Sung Park; Zixian Ma; Linjie Li; Chenhao Zheng; Cheng-Yu Hsieh; Ximing Lu; Khyathi Chandu; Quan Kong; Norimasa Kobori; Ali Farhadi; Yejin Choi; Ranjay Krishna

arXiv:2506.07643·cs.CV·June 10, 2025

Synthetic Visual Genome

Jae Sung Park, Zixian Ma, Linjie Li, Chenhao Zheng, Cheng-Yu Hsieh, Ximing Lu, Khyathi Chandu, Quan Kong, Norimasa Kobori, Ali Farhadi, Yejin Choi, Ranjay Krishna

PDF

Open Access 1 Models 1 Datasets

TL;DR

This paper introduces ROBIN, a densely annotated instruction-tuned multimodal language model trained on a synthetic scene graph dataset, achieving state-of-the-art results in visual relationship reasoning and comprehension tasks.

Contribution

The paper presents ROBIN, a novel MLM trained on SVG, a synthetic dataset, and SG-EDIT, a self-distillation framework, to improve dense scene graph generation and visual reasoning.

Findings

01

ROBIN-3B outperforms larger models on relationship understanding benchmarks.

02

ROBIN achieves 88.9 in referring expression comprehension, surpassing previous best.

03

Training on refined scene graph data is key to high performance in visual reasoning.

Abstract

Reasoning over visual relationships-spatial, functional, interactional, social, etc.-is considered to be a fundamental component of human cognition. Yet, despite the major advances in visual comprehension in multimodal language models (MLMs), precise reasoning over relationships and their generations remains a challenge. We introduce ROBIN: an MLM instruction-tuned with densely annotated relationships capable of constructing high-quality dense scene graphs at scale. To train ROBIN, we curate SVG, a synthetic scene graph dataset by completing the missing relations of selected objects in existing scene graphs using a teacher MLM and a carefully designed filtering process to ensure high-quality. To generate more accurate and rich scene graphs at scale for any image, we introduce SG-EDIT: a self-distillation framework where GPT-4o further refines ROBIN's predicted scene graphs by removing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
jamepark3922/robin-qwen2.5-3b
model· 1.3k dl
1.3k dl

Datasets

jamepark3922/svg
dataset· 119 dl
119 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Graph Neural Networks · Topic Modeling