JOG3R: Towards 3D-Consistent Video Generators
Chun-Hao Paul Huang, Niloy Mitra, Hyeonho Jeong, Jae Shin Yoon, Duygu, Ceylan

TL;DR
This paper introduces JOG3R, a unified video generator that achieves 3D-consistency and realistic video frame generation by jointly training for video synthesis and camera pose estimation.
Contribution
It proposes the first unified model that combines 3D-aware camera pose estimation with high-quality video generation, improving 3D consistency in generated videos.
Findings
The unified model produces competitive camera pose estimates.
The generated videos are more 3D-consistent and realistic.
Joint training enhances both pose estimation and video quality.
Abstract
Emergent capabilities of image generators have led to many impactful zero- or few-shot applications. Inspired by this success, we investigate whether video generators similarly exhibit 3D-awareness. Using structure-from-motion as a 3D-aware task, we test if intermediate features of a video generator - OpenSora in our case - can support camera pose estimation. Surprisingly, at first, we only find a weak correlation between the two tasks. Deeper investigation reveals that although the video generator produces plausible video frames, the frames themselves are not truly 3D-consistent. Instead, we propose to jointly train for the two tasks, using photometric generation and 3D aware errors. Specifically, we find that SoTA video generation and camera pose estimation (i.e.,DUSt3R [79]) networks share common structures, and propose an architecture that unifies the two. The proposed unified…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Image and Video Stabilization · Optical measurement and interference techniques
MethodsAttentive Walk-Aggregating Graph Neural Network
