LVLM-Composer's Explicit Planning for Image Generation
Spencer Ramsey, Jeffrey Lee, Amina Grant

TL;DR
LVLM-Composer is a large vision-language model designed with hierarchical planning and fine-grained alignment to improve compositional accuracy in text-to-image generation, especially for complex scenes with multiple objects and attributes.
Contribution
It introduces a novel hierarchical semantic planning module and a multi-stage training paradigm to enhance compositional reasoning in large vision-language models.
Findings
Outperforms state-of-the-art baselines on LongBench-T2I benchmark
Achieves higher object accuracy, composition fidelity, and pose accuracy
Human evaluations favor the perceptual quality of generated images
Abstract
The burgeoning field of generative artificial intelligence has fundamentally reshaped our approach to content creation, with Large Vision-Language Models (LVLMs) standing at its forefront. While current LVLMs have demonstrated impressive capabilities in text-to-image generation, they often falter when confronted with complex textual descriptions demanding precise compositional understanding and visual planning. This limitation particularly impacts the accurate rendering of multiple objects, their attributes, spatial relationships, and specific poses within intricate scenes, as evidenced by benchmarks like LongBench-T2I. To address these challenges, we introduce LVLM-Composer, a novel 10-billion parameter scale LVLM specifically engineered for enhanced compositional image synthesis. Our method incorporates a Hierarchical Semantic Planning Module for structured prompt decomposition and a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
