Learning Disentangled Representation in Object-Centric Models for Visual Dynamics Prediction via Transformers
Sanket Gandhi, Atul, Samanyu Mahajan, Vishal Sharma, Rushil Gupta,, Arnab Kumar Mondal, Parag Singla

TL;DR
This paper introduces a novel unsupervised object-centric model using transformers that learns disentangled representations of objects in videos, improving dynamics prediction accuracy and generalization to unseen attribute combinations.
Contribution
It presents the first general framework for learning disentangled object representations in videos without attribute assumptions, enhancing predictive accuracy and out-of-distribution performance.
Findings
Discovered semantically meaningful object blocks
Improved dynamics prediction accuracy over SOTA models
Achieved better OOD generalization in attribute combinations
Abstract
Recent work has shown that object-centric representations can greatly help improve the accuracy of learning dynamics while also bringing interpretability. In this work, we take this idea one step further, ask the following question: "can learning disentangled representation further improve the accuracy of visual dynamics prediction in object-centric models?" While there has been some attempt to learn such disentangled representations for the case of static images \citep{nsb}, to the best of our knowledge, ours is the first work which tries to do this in a general setting for video, without making any specific assumptions about the kind of attributes that an object might have. The key building block of our architecture is the notion of a {\em block}, where several blocks together constitute an object. Each block is represented as a linear combination of a given number of learnable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition
