PnP-U3D: Plug-and-Play 3D Framework Bridging Autoregression and Diffusion for Unified Understanding and Generation
Yongwei Chen, Tianyi Wei, Yushi Lan, Zhaoyang Lyu, Shangchen Zhou, Xudong Xu, Xingang Pan

TL;DR
This paper introduces PnP-U3D, a unified 3D framework that combines autoregression for understanding and diffusion for generation, enabling effective cross-modal interaction and achieving state-of-the-art results.
Contribution
It presents the first unified 3D understanding and generation framework that integrates autoregression and diffusion, leveraging pretrained models for efficiency.
Findings
Achieves state-of-the-art performance on 3D benchmarks
Effectively enables 3D editing tasks
Demonstrates strong cross-modal information exchange
Abstract
The rapid progress of large multimodal models has inspired efforts toward unified frameworks that couple understanding and generation. While such paradigms have shown remarkable success in 2D, extending them to 3D remains largely underexplored. Existing attempts to unify 3D tasks under a single autoregressive (AR) paradigm lead to significant performance degradation due to forced signal quantization and prohibitive training cost. Our key insight is that the essential challenge lies not in enforcing a unified autoregressive paradigm, but in enabling effective information interaction between generation and understanding while minimally compromising their inherent capabilities and leveraging pretrained models to reduce training cost. Guided by this perspective, we present the first unified framework for 3D understanding and generation that combines autoregression with diffusion.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Face recognition and analysis · Generative Adversarial Networks and Image Synthesis
