Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training

Peng Sun; Jun Xie; Tao Lin

arXiv:2603.16139·cs.CV·March 18, 2026

Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training

Peng Sun, Jun Xie, Tao Lin

PDF

Open Access 1 Models

TL;DR

This paper introduces IOMM, a two-stage, image-only pre-training framework for UMM visual generation that enhances efficiency and performance by reducing reliance on paired data and utilizing unlabeled images.

Contribution

Proposes IOMM, a novel two-stage training method that pre-trains visual generative components exclusively on unlabeled images, improving efficiency and state-of-the-art results.

Findings

01

IOMM-B trained with 1050 GPU hours surpasses baselines.

02

Pre-training on unlabeled images reduces data dependency.

03

Achieves top performance on GenEval and WISE benchmarks.

Abstract

Unified Multimodal Models (UMMs) are often constrained by the pre-training of their $visual generation components$ , which typically relies on inefficient paradigms and scarce, high-quality text-image paired data. In this paper, we systematically analyze pre-training recipes for $UMM visual generation$ and identify these two issues as the major bottlenecks. To address them, we propose $Image-Only Training for UMMs (IOMM)$ , a data-efficient two-stage training framework. The first stage pre-trains the visual generative component $exclusively$ using abundant unlabeled image-only data, thereby removing the dependency on paired data $for this costly phase$ . The second stage fine-tunes the model using a mixture of unlabeled images and a small curated set of text-image pairs, leading to improved instruction alignment and generative quality.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
StormyX/IOMM
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Historical Architecture and Urbanism