Learning Disentangled Representation in Object-Centric Models for Visual   Dynamics Prediction via Transformers

Sanket Gandhi; Atul; Samanyu Mahajan; Vishal Sharma; Rushil Gupta,; Arnab Kumar Mondal; Parag Singla

arXiv:2407.03216·cs.CV·July 4, 2024

Learning Disentangled Representation in Object-Centric Models for Visual Dynamics Prediction via Transformers

Sanket Gandhi, Atul, Samanyu Mahajan, Vishal Sharma, Rushil Gupta,, Arnab Kumar Mondal, Parag Singla

PDF

Open Access

TL;DR

This paper introduces a novel unsupervised object-centric model using transformers that learns disentangled representations of objects in videos, improving dynamics prediction accuracy and generalization to unseen attribute combinations.

Contribution

It presents the first general framework for learning disentangled object representations in videos without attribute assumptions, enhancing predictive accuracy and out-of-distribution performance.

Findings

01

Discovered semantically meaningful object blocks

02

Improved dynamics prediction accuracy over SOTA models

03

Achieved better OOD generalization in attribute combinations

Abstract

Recent work has shown that object-centric representations can greatly help improve the accuracy of learning dynamics while also bringing interpretability. In this work, we take this idea one step further, ask the following question: "can learning disentangled representation further improve the accuracy of visual dynamics prediction in object-centric models?" While there has been some attempt to learn such disentangled representations for the case of static images \citep{nsb}, to the best of our knowledge, ours is the first work which tries to do this in a general setting for video, without making any specific assumptions about the kind of attributes that an object might have. The key building block of our architecture is the notion of a {\em block}, where several blocks together constitute an object. Each block is represented as a linear combination of a given number of learnable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition