TL;DR
This paper introduces a novel stochastic image-to-video synthesis method using conditional invertible neural networks (cINNs) that models static content and residual dynamics separately, enabling controlled and diverse video generation from a single image.
Contribution
It presents a bijective, one-to-one mapping approach for image-to-video synthesis using cINNs, allowing explicit control over static and dynamic content in generated videos.
Findings
Effective in diverse datasets
Produces high-quality, diverse videos
Enables controlled video synthesis
Abstract
Video understanding calls for a model to learn the characteristic interplay between static scene content and its dynamics: Given an image, the model must be able to predict a future progression of the portrayed scene and, conversely, a video should be explained in terms of its static image content and all the remaining characteristics not present in the initial frame. This naturally suggests a bijective mapping between the video domain and the static content as well as residual information. In contrast to common stochastic image-to-video synthesis, such a model does not merely generate arbitrary videos progressing the initial image. Given this image, it rather provides a one-to-one mapping between the residual vectors and the video with stochastic outcomes when sampling. The approach is naturally implemented using a conditional invertible neural network (cINN) that can explain videos by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
