Time-Conditioned Generative Modeling of Object-Centric Representations for Video Decomposition and Prediction
Chengmin Gao, Bin Li

TL;DR
This paper introduces a time-conditioned generative model for videos that improves object shape reconstruction and novel view prediction by disentangling object and view representations, using Transformers and Gaussian processes.
Contribution
It presents a novel approach combining Transformers and Gaussian processes to enhance object-centric video modeling without requiring viewpoint annotations.
Findings
Accurately reconstructs complete object shapes even when occluded.
Enables novel view prediction without explicit viewpoint labels.
Demonstrates superior performance on multiple datasets.
Abstract
When perceiving the world from multiple viewpoints, humans have the ability to reason about the complete objects in a compositional manner even when an object is completely occluded from certain viewpoints. Meanwhile, humans are able to imagine novel views after observing multiple viewpoints. Recent remarkable advances in multi-view object-centric learning still leaves some unresolved problems: 1) The shapes of partially or completely occluded objects can not be well reconstructed. 2) The novel viewpoint prediction depends on expensive viewpoint annotations rather than implicit rules in view representations. In this paper, we introduce a time-conditioned generative model for videos. To reconstruct the complete shape of an object accurately, we enhance the disentanglement between the latent representations of objects and views, where the latent representations of time-conditioned views…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Dense Connections · Adam · Position-Wise Feed-Forward Layer · Softmax · Linear Layer · Absolute Position Encodings · Dropout · Label Smoothing
