LVLM-Composer's Explicit Planning for Image Generation

Spencer Ramsey; Jeffrey Lee; Amina Grant

arXiv:2507.04152·cs.CV·July 8, 2025

LVLM-Composer's Explicit Planning for Image Generation

Spencer Ramsey, Jeffrey Lee, Amina Grant

PDF

TL;DR

LVLM-Composer is a large vision-language model designed with hierarchical planning and fine-grained alignment to improve compositional accuracy in text-to-image generation, especially for complex scenes with multiple objects and attributes.

Contribution

It introduces a novel hierarchical semantic planning module and a multi-stage training paradigm to enhance compositional reasoning in large vision-language models.

Findings

01

Outperforms state-of-the-art baselines on LongBench-T2I benchmark

02

Achieves higher object accuracy, composition fidelity, and pose accuracy

03

Human evaluations favor the perceptual quality of generated images

Abstract

The burgeoning field of generative artificial intelligence has fundamentally reshaped our approach to content creation, with Large Vision-Language Models (LVLMs) standing at its forefront. While current LVLMs have demonstrated impressive capabilities in text-to-image generation, they often falter when confronted with complex textual descriptions demanding precise compositional understanding and visual planning. This limitation particularly impacts the accurate rendering of multiple objects, their attributes, spatial relationships, and specific poses within intricate scenes, as evidenced by benchmarks like LongBench-T2I. To address these challenges, we introduce LVLM-Composer, a novel 10-billion parameter scale LVLM specifically engineered for enhanced compositional image synthesis. Our method incorporates a Hierarchical Semantic Planning Module for structured prompt decomposition and a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.