Unlocking Compositional Control: Self-Supervision for LVLM-Based Image Generation

Fernando Gabriela Garcia; Spencer Burns; Ryan Shaw; Hunter Young

arXiv:2507.04151·cs.CV·July 8, 2025

Unlocking Compositional Control: Self-Supervision for LVLM-Based Image Generation

Fernando Gabriela Garcia, Spencer Burns, Ryan Shaw, Hunter Young

PDF

TL;DR

This paper presents Hi-SSLVLM, a self-supervised hierarchical model that improves text-to-image synthesis by enabling detailed control over complex prompts without extensive labeled data.

Contribution

The introduction of a two-stage self-supervised learning framework with hierarchical captioning and internal compositional planning for enhanced image generation control.

Findings

01

Outperforms leading baselines on multiple benchmarks

02

Achieves higher fidelity and compositional accuracy

03

Validated by human evaluations

Abstract

This paper introduces Hierarchical Self-Supervised LVLM (Hi-SSLVLM), a novel generative model designed to significantly advance text-to-image synthesis, particularly for complex and compositionally challenging prompts. Traditional methods often grapple with the high cost of meticulously curated paired image-text datasets and struggle with precise control over fine-grained visual attributes and intricate spatial relationships. Our Hi-SSLVLM addresses these limitations through a unique two-stage self-supervised learning strategy. The first stage, Multi-Granularity Visual-Language Grounding, enables the Large Vision-Language Model (LVLM) backbone to autonomously generate and align hierarchical captions (global and local) to images, cultivating a deep internal semantic understanding without reliance on extensive human annotation. The second stage, Self-Refinement and Guided Image…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.