Time-Conditioned Generative Modeling of Object-Centric Representations   for Video Decomposition and Prediction

Chengmin Gao; Bin Li

arXiv:2301.08951·cs.CV·October 27, 2023·1 cites

Time-Conditioned Generative Modeling of Object-Centric Representations for Video Decomposition and Prediction

Chengmin Gao, Bin Li

PDF

Open Access 1 Repo

TL;DR

This paper introduces a time-conditioned generative model for videos that improves object shape reconstruction and novel view prediction by disentangling object and view representations, using Transformers and Gaussian processes.

Contribution

It presents a novel approach combining Transformers and Gaussian processes to enhance object-centric video modeling without requiring viewpoint annotations.

Findings

01

Accurately reconstructs complete object shapes even when occluded.

02

Enables novel view prediction without explicit viewpoint labels.

03

Demonstrates superior performance on multiple datasets.

Abstract

When perceiving the world from multiple viewpoints, humans have the ability to reason about the complete objects in a compositional manner even when an object is completely occluded from certain viewpoints. Meanwhile, humans are able to imagine novel views after observing multiple viewpoints. Recent remarkable advances in multi-view object-centric learning still leaves some unresolved problems: 1) The shapes of partially or completely occluded objects can not be well reconstructed. 2) The novel viewpoint prediction depends on expensive viewpoint annotations rather than implicit rules in view representations. In this paper, we introduce a time-conditioned generative model for videos. To reconstruct the complete shape of an object accurately, we enhance the disentanglement between the latent representations of objects and views, where the latent representations of time-conditioned views…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

FudanVI/compositional-scene-representation-toolbox/tree/main/video-decomposition-prediction
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Dense Connections · Adam · Position-Wise Feed-Forward Layer · Softmax · Linear Layer · Absolute Position Encodings · Dropout · Label Smoothing