JOG3R: Towards 3D-Consistent Video Generators

Chun-Hao Paul Huang; Niloy Mitra; Hyeonho Jeong; Jae Shin Yoon; Duygu; Ceylan

arXiv:2501.01409·cs.CV·March 28, 2025

JOG3R: Towards 3D-Consistent Video Generators

Chun-Hao Paul Huang, Niloy Mitra, Hyeonho Jeong, Jae Shin Yoon, Duygu, Ceylan

PDF

Open Access

TL;DR

This paper introduces JOG3R, a unified video generator that achieves 3D-consistency and realistic video frame generation by jointly training for video synthesis and camera pose estimation.

Contribution

It proposes the first unified model that combines 3D-aware camera pose estimation with high-quality video generation, improving 3D consistency in generated videos.

Findings

01

The unified model produces competitive camera pose estimates.

02

The generated videos are more 3D-consistent and realistic.

03

Joint training enhances both pose estimation and video quality.

Abstract

Emergent capabilities of image generators have led to many impactful zero- or few-shot applications. Inspired by this success, we investigate whether video generators similarly exhibit 3D-awareness. Using structure-from-motion as a 3D-aware task, we test if intermediate features of a video generator - OpenSora in our case - can support camera pose estimation. Surprisingly, at first, we only find a weak correlation between the two tasks. Deeper investigation reveals that although the video generator produces plausible video frames, the frames themselves are not truly 3D-consistent. Instead, we propose to jointly train for the two tasks, using photometric generation and 3D aware errors. Specifically, we find that SoTA video generation and camera pose estimation (i.e.,DUSt3R [79]) networks share common structures, and propose an architecture that unifies the two. The proposed unified…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Image and Video Stabilization · Optical measurement and interference techniques

MethodsAttentive Walk-Aggregating Graph Neural Network