Unsupervised Learning of Disentangled Representations from Video

Remi Denton; Vighnesh Birodkar

arXiv:1705.10915·cs.LG·March 15, 2024·228 cites

Unsupervised Learning of Disentangled Representations from Video

Remi Denton, Vighnesh Birodkar

PDF

Open Access 2 Repos

TL;DR

This paper introduces DrNET, an unsupervised model that learns disentangled video representations by separating stationary and dynamic parts, enabling future frame prediction and coherent video generation.

Contribution

The paper proposes a novel adversarial approach to learn disentangled representations from video, leveraging temporal coherence for improved future frame prediction.

Findings

01

Successfully disentangles stationary and dynamic components in videos

02

Enables accurate long-term future frame prediction

03

Demonstrates effectiveness on synthetic and real videos

Abstract

We present a new model DrNET that learns disentangled image representations from video. Our approach leverages the temporal coherence of video and a novel adversarial loss to learn a representation that factorizes each frame into a stationary part and a temporally varying component. The disentangled representation can be used for a range of tasks. For example, applying a standard LSTM to the time-vary components enables prediction of future frames. We evaluate our approach on a range of synthetic and real videos, demonstrating the ability to coherently generate hundreds of steps into the future.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Digital Media Forensic Detection · Adversarial Robustness in Machine Learning

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory