Large Motion Video Autoencoding with Cross-modal Video VAE
Yazhou Xing, Yang Fei, Yingqing He, Jingye Chen, Jiaxin Xie, Xiaowei, Chi, Qifeng Chen

TL;DR
This paper introduces a novel video autoencoder that combines temporal-aware spatial compression, lightweight motion modeling, and text guidance to achieve high-fidelity, stable, and versatile video encoding and reconstruction.
Contribution
It proposes a new video VAE architecture that effectively disentangles spatial and temporal compression, incorporates text guidance, and is trained jointly on images and videos for improved performance.
Findings
Outperforms recent baselines in video reconstruction quality
Achieves high-fidelity and temporally stable video encoding
Enables joint image and video autoencoding
Abstract
Learning a robust video Variational Autoencoder (VAE) is essential for reducing video redundancy and facilitating efficient video generation. Directly applying image VAEs to individual frames in isolation can result in temporal inconsistencies and suboptimal compression rates due to a lack of temporal compression. Existing Video VAEs have begun to address temporal compression; however, they often suffer from inadequate reconstruction performance. In this paper, we present a novel and powerful video autoencoder capable of high-fidelity video encoding. First, we observe that entangling spatial and temporal compression by merely extending the image VAE to a 3D VAE can introduce motion blur and detail distortion artifacts. Thus, we propose temporal-aware spatial compression to better encode and decode the spatial information. Additionally, we integrate a lightweight motion compression model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image Processing Techniques · Image and Video Quality Assessment · Image and Signal Denoising Methods
