ViPro-2: Unsupervised State Estimation via Integrated Dynamics for Guiding Video Prediction
Patrick Takenaka, Johannes Maucher, Marco F. Huber

TL;DR
ViPro-2 introduces an unsupervised method for accurate state estimation in video prediction by integrating dynamics, overcoming previous limitations of relying on ground truth initial states, and extends datasets to better simulate real-world conditions.
Contribution
It presents improvements to ViPro enabling unsupervised state inference from observations without initial ground truth, and extends datasets for more realistic evaluation.
Findings
Successful unsupervised state inference from observations.
Enhanced model performance on extended 3D dataset.
Overcame previous shortcut learning issues.
Abstract
Predicting future video frames is a challenging task with many downstream applications. Previous work has shown that procedural knowledge enables deep models for complex dynamical settings, however their model ViPro assumed a given ground truth initial symbolic state. We show that this approach led to the model learning a shortcut that does not actually connect the observed environment with the predicted symbolic state, resulting in the inability to estimate states given an observation if previous states are noisy. In this work, we add several improvements to ViPro that enables the model to correctly infer states from observations without providing a full ground truth state in the beginning. We show that this is possible in an unsupervised manner, and extend the original Orbits dataset with a 3D variant to close the gap to real world scenarios.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Human Pose and Action Recognition · Advanced Vision and Imaging
