Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training
Peng Sun, Jun Xie, Tao Lin

TL;DR
This paper introduces IOMM, a two-stage, image-only pre-training framework for UMM visual generation that enhances efficiency and performance by reducing reliance on paired data and utilizing unlabeled images.
Contribution
Proposes IOMM, a novel two-stage training method that pre-trains visual generative components exclusively on unlabeled images, improving efficiency and state-of-the-art results.
Findings
IOMM-B trained with 1050 GPU hours surpasses baselines.
Pre-training on unlabeled images reduces data dependency.
Achieves top performance on GenEval and WISE benchmarks.
Abstract
Unified Multimodal Models (UMMs) are often constrained by the pre-training of their , which typically relies on inefficient paradigms and scarce, high-quality text-image paired data. In this paper, we systematically analyze pre-training recipes for and identify these two issues as the major bottlenecks. To address them, we propose , a data-efficient two-stage training framework. The first stage pre-trains the visual generative component using abundant unlabeled image-only data, thereby removing the dependency on paired data . The second stage fine-tunes the model using a mixture of unlabeled images and a small curated set of text-image pairs, leading to improved instruction alignment and generative quality.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Historical Architecture and Urbanism
