Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm

Yaofang Liu; Kangning Cui; Meng Chu; Zhaoqing Li; Suiyun Zhang; Jean-Michel Morel; Xiaodong Cun; Haoxuan Che; Rui Liu; and Raymond H. Chan

arXiv:2605.12271·cs.CV·May 13, 2026

Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm

Yaofang Liu, Kangning Cui, Meng Chu, Zhaoqing Li, Suiyun Zhang, Jean-Michel Morel, Xiaodong Cun, Haoxuan Che, Rui Liu, and Raymond H. Chan

PDF

TL;DR

This paper introduces V2V-Zero, a visual-to-visual generation framework that conditions models on visual specifications instead of text, enabling more precise and flexible image creation without fine-tuning.

Contribution

It presents a training-free method to incorporate visual conditioning into existing vision-language models, expanding the capabilities of visual generation.

Findings

01

V2V-Zero achieves high performance on GenEval, close to optimized text-to-image models.

02

The Simple-V2V Bench reveals strengths in attribute binding but challenges in content reliability and structural control.

03

The interface transfers effectively to video extension, demonstrating versatility.

Abstract

Humans often specify and create through visual artifacts: typography sheets, sketches, reference images, and annotated scenes. Yet modern visual generators still ask users to serialize this intent into text, a bottleneck that compresses signals like spatial structure, exact appearance, and glyph shape. We propose \textbf{\emph{visual-to-visual} (V2V)} generation, in which the user conditions a generative model with a visual specification page rather than a text prompt. The page is not an edit target, but a visual document that specifies the desired output. We introduce \textbf{V2V-Zero}, a training-free framework that exposes this interface in existing vision-language model (VLM) conditioned generators by replacing text-only conditioning with final-layer hidden states extracted from visual pages, exploiting the fact that the frozen VLM already maps both text and images into the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.