Learning Visual Generative Priors without Text

Shuailei Ma; Kecheng Zheng; Ying Wei; Wei Wu; Fan Lu and; Yifei Zhang; Chen-Wei Xie; Biao Gong; Jiapeng Zhu; Yujun Shen

arXiv:2412.07767·cs.CV·March 25, 2025

Learning Visual Generative Priors without Text

Shuailei Ma, Kecheng Zheng, Ying Wei, Wei Wu, Fan Lu and, Yifei Zhang, Chen-Wei Xie, Biao Gong, Jiapeng Zhu, Yujun Shen

PDF

Open Access 1 Models

TL;DR

This paper introduces Lumos, a self-supervised vision-based framework for image-to-image generation that learns visual priors without relying on expensive text-image pairs, outperforming some text-to-image models on certain tasks.

Contribution

Lumos demonstrates a scalable, pure vision-based training method for I2I models that serve as strong visual priors, reducing dependence on text-image data and outperforming T2I models in some tasks.

Findings

01

Lumos can learn I2I models from in-the-wild images in a self-supervised manner.

02

I2I models serve as better visual priors than T2I models for certain tasks.

03

I2I priors outperform T2I priors on text-irrelevant tasks like image-to-3D and image-to-video.

Abstract

Although text-to-image (T2I) models have recently thrived as visual generative priors, their reliance on high-quality text-image pairs makes scaling up expensive. We argue that grasping the cross-modality alignment is not a necessity for a sound visual generative prior, whose focus should be on texture modeling. Such a philosophy inspires us to study image-to-image (I2I) generation, where models can learn from in-the-wild images in a self-supervised manner. We first develop a pure vision-based training framework, Lumos, and confirm the feasibility and the scalability of learning I2I models. We then find that, as an upstream task of T2I, our I2I model serves as a more foundational visual prior and achieves on-par or better performance than existing T2I models using only 1/10 text-image pairs for fine-tuning. We further demonstrate the superiority of I2I priors over T2I priors on some…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Xiaomabufei/lumos
model· ♡ 8
♡ 8

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · AI-based Problem Solving and Planning · Educational Assessment and Pedagogy

MethodsFocus