ViewMask-1-to-3: Multi-View Consistent Image Generation via Multimodal Diffusion Models

Ruishu Zhu; Zhihao Huang; Jiacheng Sun; Ping Luo; Hongyuan Zhang; Xuelong Li

arXiv:2512.14099·cs.CV·March 16, 2026

ViewMask-1-to-3: Multi-View Consistent Image Generation via Multimodal Diffusion Models

Ruishu Zhu, Zhihao Huang, Jiacheng Sun, Ping Luo, Hongyuan Zhang, Xuelong Li

PDF

Open Access

TL;DR

This paper introduces ViewMask-1-to-3, a novel discrete diffusion approach for multi-view image generation that achieves state-of-the-art results by unifying language and vision in a shared token space without complex architectures.

Contribution

It formulates multi-view synthesis as a discrete sequence modeling problem using masked token prediction, enabling progressive multi-view generation with improved cross-view consistency.

Findings

01

Outperforms baseline on GSO and 3D-FUTURE benchmarks

02

Ranks first on average across standard image metrics

03

Improves IoU by 10.6% on 3D-FUTURE

Abstract

Motivated by discrete diffusion's success in language-vision modeling, we explore its potential for multi-view generation, a task dominated by continuous approaches. We introduce ViewMask-1-to-3, formulating multi-view synthesis as a discrete sequence modeling problem where each viewpoint is represented as visual tokens from MAGVIT-v2. Through masked token prediction, our approach enables progressive multi-view generation via iterative token unmasking, unifying language and vision in a shared token space. Importantly, simple random masking combined with self-attention naturally encourages cross-view consistency without specialized architectures or 3D geometric priors. Our method outperforms the baseline on the GSO and 3D-FUTURE benchmarks, ranking first on average across standard image metrics and improving IoU by 10.6% on 3D-FUTURE. This validates discrete diffusion as a promising…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Computer Graphics and Visualization Techniques