Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm
Yaofang Liu, Kangning Cui, Meng Chu, Zhaoqing Li, Suiyun Zhang, Jean-Michel Morel, Xiaodong Cun, Haoxuan Che, Rui Liu, and Raymond H. Chan

TL;DR
This paper introduces V2V-Zero, a visual-to-visual generation framework that conditions models on visual specifications instead of text, enabling more precise and flexible image creation without fine-tuning.
Contribution
It presents a training-free method to incorporate visual conditioning into existing vision-language models, expanding the capabilities of visual generation.
Findings
V2V-Zero achieves high performance on GenEval, close to optimized text-to-image models.
The Simple-V2V Bench reveals strengths in attribute binding but challenges in content reliability and structural control.
The interface transfers effectively to video extension, demonstrating versatility.
Abstract
Humans often specify and create through visual artifacts: typography sheets, sketches, reference images, and annotated scenes. Yet modern visual generators still ask users to serialize this intent into text, a bottleneck that compresses signals like spatial structure, exact appearance, and glyph shape. We propose \textbf{\emph{visual-to-visual} (V2V)} generation, in which the user conditions a generative model with a visual specification page rather than a text prompt. The page is not an edit target, but a visual document that specifies the desired output. We introduce \textbf{V2V-Zero}, a training-free framework that exposes this interface in existing vision-language model (VLM) conditioned generators by replacing text-only conditioning with final-layer hidden states extracted from visual pages, exploiting the fact that the frozen VLM already maps both text and images into the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
