StruVis: Enhancing Reasoning-based Text-to-Image Generation via Thinking with Structured Vision

Yuanhuiyi Lyu; Kaiyu Lei; Ziqiao Weng; Xu Zheng; Lutao Jiang; Teng Li; Yangfu Li; Ziyuan Huang; Linfeng Zhang; Xuming Hu

arXiv:2603.06032·cs.CV·March 9, 2026

StruVis: Enhancing Reasoning-based Text-to-Image Generation via Thinking with Structured Vision

Yuanhuiyi Lyu, Kaiyu Lei, Ziqiao Weng, Xu Zheng, Lutao Jiang, Teng Li, Yangfu Li, Ziyuan Huang, Linfeng Zhang, Xuming Hu

PDF

Open Access

TL;DR

StruVis introduces a text-based structured visual reasoning framework that improves reasoning-based text-to-image generation without intermediate image generation, enhancing performance across multiple benchmarks.

Contribution

StruVis is a novel, generator-agnostic framework that employs text-based visual representations to enhance reasoning in T2I generation, avoiding costly image generation steps.

Findings

01

Achieves 4.61% improvement on T2I-ReasonBench

02

Achieves 4% improvement on WISE

03

Enhances reasoning-based T2I performance significantly

Abstract

Reasoning-based text-to-image (T2I) generation requires models to interpret complex prompts accurately. Existing reasoning frameworks can be broadly categorized into two types: (1) Text-Only Reasoning, which is computationally efficient but lacks access to visual context, often resulting in the omission of critical spatial and visual elements; and (2) Text-Image Interleaved Reasoning, which leverages a T2I generator to provide visual references during the reasoning process. While this approach enhances visual grounding, it incurs substantial computational costs and constrains the reasoning capacity of MLLMs to the representational limitations of the generator. To this end, we propose StruVis, a novel framework that enhances T2I generation through Thinking with Structured Vision. Instead of relying on intermediate image generation, StruVis employs text-based structured visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling