Video Autoencoder: self-supervised disentanglement of static 3D   structure and motion

Zihang Lai; Sifei Liu; Alexei A. Efros; Xiaolong Wang

arXiv:2110.02951·cs.CV·October 7, 2021

Video Autoencoder: self-supervised disentanglement of static 3D structure and motion

Zihang Lai, Sifei Liu, Alexei A. Efros, Xiaolong Wang

PDF

Open Access

TL;DR

This paper introduces a self-supervised video autoencoder that disentangles 3D scene structure and camera motion from videos, enabling tasks like novel view synthesis and pose estimation without ground truth annotations.

Contribution

It presents a novel self-supervised method for disentangling 3D structure and motion in videos using a deep autoencoder trained with pixel reconstruction loss.

Findings

01

Effective disentanglement of 3D structure and camera pose.

02

Successful application to view synthesis and pose estimation.

03

Good generalization to out-of-domain videos.

Abstract

A video autoencoder is proposed for learning disentan- gled representations of 3D structure and camera pose from videos in a self-supervised manner. Relying on temporal continuity in videos, our work assumes that the 3D scene structure in nearby video frames remains static. Given a sequence of video frames as input, the video autoencoder extracts a disentangled representation of the scene includ- ing: (i) a temporally-consistent deep voxel feature to represent the 3D structure and (ii) a 3D trajectory of camera pose for each frame. These two representations will then be re-entangled for rendering the input video frames. This video autoencoder can be trained directly using a pixel reconstruction loss, without any ground truth 3D or camera pose annotations. The disentangled representation can be applied to a range of tasks, including novel view synthesis, camera pose estimation, and video…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Advanced Image Processing Techniques · Generative Adversarial Networks and Image Synthesis