SATURN: Autoregressive Image Generation Guided by Scene Graphs

Thanh-Nhan Vo; Trong-Thuan Nguyen; Tam V. Nguyen; Minh-Triet Tran

arXiv:2508.14502·cs.CV·August 21, 2025

SATURN: Autoregressive Image Generation Guided by Scene Graphs

Thanh-Nhan Vo, Trong-Thuan Nguyen, Tam V. Nguyen, Minh-Triet Tran

PDF

Open Access

TL;DR

SATURN is a lightweight autoregressive model that improves scene graph-guided image generation by translating scene graphs into token sequences, significantly enhancing fidelity and object relation accuracy without complex pipelines.

Contribution

Introduces SATURN, a novel method that converts scene graphs into token sequences for autoregressive image generation, outperforming prior graph-guided approaches in speed and quality.

Findings

01

Reduces FID from 56.45% to 21.62% on Visual Genome

02

Increases Inception Score from 16.03 to 24.78

03

Outperforms prior methods like SG2IM and SGDiff

Abstract

State-of-the-art text-to-image models excel at photorealistic rendering but often struggle to capture the layout and object relationships implied by complex prompts. Scene graphs provide a natural structural prior, yet previous graph-guided approaches have typically relied on heavy GAN or diffusion pipelines, which lag behind modern autoregressive architectures in both speed and fidelity. We introduce SATURN (Structured Arrangement of Triplets for Unified Rendering Networks), a lightweight extension to VAR-CLIP that translates a scene graph into a salience-ordered token sequence, enabling a frozen CLIP-VQ-VAE backbone to interpret graph structure while fine-tuning only the VAR transformer. On the Visual Genome dataset, SATURN reduces FID from 56.45% to 21.62% and increases the Inception Score from 16.03 to 24.78, outperforming prior methods such as SG2IM and SGDiff without requiring…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques