ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning

Feng Han; Yang Jiao; Shaoxiang Chen; Junhao Xu; Jingjing Chen; Yu-Gang Jiang

arXiv:2506.03596·cs.CV·November 25, 2025

ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning

Feng Han, Yang Jiao, Shaoxiang Chen, Junhao Xu, Jingjing Chen, Yu-Gang Jiang

PDF

Open Access 1 Repo 1 Models

TL;DR

ControlThinker introduces a visual reasoning-based framework that enriches text prompts with latent semantics from control images, improving semantic consistency and visual quality in controllable image generation.

Contribution

It presents a novel comprehend-then-generate paradigm leveraging visual reasoning and a metric-based reward to bridge the semantic gap in controllable image synthesis.

Findings

01

Enhanced semantic consistency across benchmarks

02

Improved visual quality of generated images

03

Effective mitigation of semantic gap issues

Abstract

The field of controllable image generation has seen significant advancements, with various architectures improving generation layout consistency with control signals. However, contemporary methods still face challenges in bridging the semantic gap between input text prompts with sparse semantics and the target images, often over-relying on low-level control signals to infer regional details. To address this challenge, we propose ControlThinker, a novel framework that employs a "comprehend-then-generate" paradigm. Firstly, by incentivizing the visual reasoning capability of a MLLM, latent semantics from control images are mined to enrich text prompts. This enriched semantic understanding then seamlessly aids in image generation without the need for additional complex modifications. To further tackle the uncertainty arising from the ambiguity of control images, we encourage broader…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

maplebb/controlthinker
noneOfficial

Models

🤗
maplebb/ControlThinker
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Visual Attention and Saliency Detection