Generating an Image From 1,000 Words: Enhancing Text-to-Image With Structured Captions

Eyal Gutflaish; Eliran Kachlon; Hezi Zisman; Tal Hacham; Nimrod Sarid; Alexander Visheratin; Saar Huberman; Gal Davidi; Guy Bukchin; Kfir Goldberg; Ron Mokady

arXiv:2511.06876·cs.CV·November 11, 2025

Generating an Image From 1,000 Words: Enhancing Text-to-Image With Structured Captions

Eyal Gutflaish, Eliran Kachlon, Hezi Zisman, Tal Hacham, Nimrod Sarid, Alexander Visheratin, Saar Huberman, Gal Davidi, Guy Bukchin, Kfir Goldberg, Ron Mokady

PDF

Open Access 6 Models

TL;DR

This paper introduces a new open-source text-to-image model trained on structured, detailed captions to improve controllability and expressiveness, supported by a novel evaluation protocol and a fusion mechanism for processing long captions.

Contribution

It presents the first open-source model trained on structured captions, a new fusion mechanism DimFusion, and the TaBR evaluation protocol for assessing controllability.

Findings

01

FIBO achieves state-of-the-art prompt alignment.

02

Structured captions improve controllability.

03

DimFusion efficiently processes long captions.

Abstract

Text-to-image models have rapidly evolved from casual creative tools to professional-grade systems, achieving unprecedented levels of image quality and realism. Yet, most models are trained to map short prompts into detailed images, creating a gap between sparse textual input and rich visual outputs. This mismatch reduces controllability, as models often fill in missing details arbitrarily, biasing toward average user preferences and limiting precision for professional use. We address this limitation by training the first open-source text-to-image model on long structured captions, where every training sample is annotated with the same set of fine-grained attributes. This design maximizes expressive coverage and enables disentangled control over visual factors. To process long captions efficiently, we propose DimFusion, a fusion mechanism that integrates intermediate tokens from a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques