FlowInOne:Unifying Multimodal Generation as Image-in, Image-out Flow Matching

Junchao Yi; Rui Zhao; Jiahao Tang; Weixian Lei; Linjie Li; Qisheng Su; Zhengyuan Yang; Lijuan Wang; Xiaofeng Zhu; Alex Jinpeng Wang

arXiv:2604.06757·cs.CV·May 15, 2026

FlowInOne:Unifying Multimodal Generation as Image-in, Image-out Flow Matching

Junchao Yi, Rui Zhao, Jiahao Tang, Weixian Lei, Linjie Li, Qisheng Su, Zhengyuan Yang, Lijuan Wang, Xiaofeng Zhu, Alex Jinpeng Wang

PDF

2 Repos 1 Models 2 Datasets

TL;DR

FlowInOne introduces a unified, vision-centric framework for multimodal generation by reformulating all modalities as visual flows, simplifying the pipeline and achieving state-of-the-art results.

Contribution

It unifies multimodal generation into a single visual flow model, eliminating cross-modal alignment issues and integrating diverse tasks under one paradigm.

Findings

01

Achieves state-of-the-art performance across multiple generation tasks.

02

Outperforms open-source and commercial models in unified multimodal generation.

03

Introduces VisPrompt-5M dataset and VP-Bench for evaluation.

Abstract

Multimodal generation has long been dominated by text-driven pipelines where language dictates vision but cannot reason or create within it. We challenge this paradigm by asking whether all modalities, including textual descriptions, spatial layouts, and editing instructions, can be unified into a single visual representation. We present FlowInOne, a framework that reformulates multimodal generation as a purely visual flow, converting all inputs into visual prompts and enabling a clean image-in, image-out pipeline governed by a single flow matching model. This vision-centric formulation naturally eliminates cross-modal alignment bottlenecks, noise scheduling, and task-specific architectural branches, unifying text-to-image generation, layout-guided editing, and visual instruction following under one coherent paradigm. To support this, we introduce VisPrompt-5M, a large-scale dataset of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
CSU-JPG/FlowInOne
model· 4 dl· ♡ 15
4 dl♡ 15

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.