A Disentangled Recognition and Nonlinear Dynamics Model for Unsupervised Learning
Marco Fraccaro, Simon Kamronn, Ulrich Paquet, Ole Winther

TL;DR
This paper introduces a Kalman variational auto-encoder that learns disentangled latent representations for objects and their dynamics in videos, enabling improved temporal reasoning and data imputation without high-dimensional frame generation.
Contribution
It presents a novel unsupervised model that separates object recognition from dynamic state evolution in latent space, enhancing video understanding and prediction.
Findings
Outperforms existing methods in generative tasks
Achieves superior missing data imputation
Effective on simulated physical systems
Abstract
This paper takes a step towards temporal reasoning in a dynamically changing video, not in the pixel space that constitutes its frames, but in a latent space that describes the non-linear dynamics of the objects in its world. We introduce the Kalman variational auto-encoder, a framework for unsupervised learning of sequential data that disentangles two latent representations: an object's representation, coming from a recognition model, and a latent state describing its dynamics. As a result, the evolution of the world can be imagined and missing data imputed, both without the need to generate high dimensional frames at each time step. The model is trained end-to-end on videos of a variety of simulated physical systems, and outperforms competing methods in generative and missing data imputation tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image Processing Techniques · Gaussian Processes and Bayesian Inference · Model Reduction and Neural Networks
