Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors
Rong Wang, Ruyi Zha, Ziang Cheng, Jiayu Yang, Pulak Purkait, Hongdong Li

TL;DR
This paper introduces a new method for generating realistic and consistent orbital videos from a single image by leveraging 3D shape priors from a foundational generative model, improving long-range view synthesis.
Contribution
It proposes integrating 3D shape priors via a multi-scale 3D adapter into video generation, enhancing shape realism and view consistency over prior pixel-wise attention methods.
Findings
Outperforms state-of-the-art methods in visual quality and shape realism.
Achieves superior multi-view consistency and generalization to complex trajectories.
Effectively models complete object shapes without explicit mesh extraction.
Abstract
We present a novel method for generating geometrically realistic and consistent orbital videos from a single image of an object. Existing video generation works mostly rely on pixel-wise attention to enforce view consistency across frames. However, such mechanism does not impose sufficient constraints for long-range extrapolation, e.g. rear-view synthesis, in which pixel correspondences to the input image are limited. Consequently, these works often fail to produce results with a plausible and coherent structure. To tackle this issue, we propose to leverage rich shape priors from a 3D foundational generative model as an auxiliary constraint, motivated by its capability of modeling realistic object shape distributions learned from large 3D asset corpora. Specifically, we prompt the video generation with two scales of latent features encoded by the 3D foundation model: (i) a denoised…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
