Unlocking Compositional Control: Self-Supervision for LVLM-Based Image Generation
Fernando Gabriela Garcia, Spencer Burns, Ryan Shaw, Hunter Young

TL;DR
This paper presents Hi-SSLVLM, a self-supervised hierarchical model that improves text-to-image synthesis by enabling detailed control over complex prompts without extensive labeled data.
Contribution
The introduction of a two-stage self-supervised learning framework with hierarchical captioning and internal compositional planning for enhanced image generation control.
Findings
Outperforms leading baselines on multiple benchmarks
Achieves higher fidelity and compositional accuracy
Validated by human evaluations
Abstract
This paper introduces Hierarchical Self-Supervised LVLM (Hi-SSLVLM), a novel generative model designed to significantly advance text-to-image synthesis, particularly for complex and compositionally challenging prompts. Traditional methods often grapple with the high cost of meticulously curated paired image-text datasets and struggle with precise control over fine-grained visual attributes and intricate spatial relationships. Our Hi-SSLVLM addresses these limitations through a unique two-stage self-supervised learning strategy. The first stage, Multi-Granularity Visual-Language Grounding, enables the Large Vision-Language Model (LVLM) backbone to autonomously generate and align hierarchical captions (global and local) to images, cultivating a deep internal semantic understanding without reliance on extensive human annotation. The second stage, Self-Refinement and Guided Image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
