SATURN: Autoregressive Image Generation Guided by Scene Graphs
Thanh-Nhan Vo, Trong-Thuan Nguyen, Tam V. Nguyen, Minh-Triet Tran

TL;DR
SATURN is a lightweight autoregressive model that improves scene graph-guided image generation by translating scene graphs into token sequences, significantly enhancing fidelity and object relation accuracy without complex pipelines.
Contribution
Introduces SATURN, a novel method that converts scene graphs into token sequences for autoregressive image generation, outperforming prior graph-guided approaches in speed and quality.
Findings
Reduces FID from 56.45% to 21.62% on Visual Genome
Increases Inception Score from 16.03 to 24.78
Outperforms prior methods like SG2IM and SGDiff
Abstract
State-of-the-art text-to-image models excel at photorealistic rendering but often struggle to capture the layout and object relationships implied by complex prompts. Scene graphs provide a natural structural prior, yet previous graph-guided approaches have typically relied on heavy GAN or diffusion pipelines, which lag behind modern autoregressive architectures in both speed and fidelity. We introduce SATURN (Structured Arrangement of Triplets for Unified Rendering Networks), a lightweight extension to VAR-CLIP that translates a scene graph into a salience-ordered token sequence, enabling a frozen CLIP-VQ-VAE backbone to interpret graph structure while fine-tuning only the VAR transformer. On the Visual Genome dataset, SATURN reduces FID from 56.45% to 21.62% and increases the Inception Score from 16.03 to 24.78, outperforming prior methods such as SG2IM and SGDiff without requiring…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
