Large Motion Video Autoencoding with Cross-modal Video VAE

Yazhou Xing; Yang Fei; Yingqing He; Jingye Chen; Jiaxin Xie; Xiaowei; Chi; Qifeng Chen

arXiv:2412.17805·cs.CV·December 24, 2024

Large Motion Video Autoencoding with Cross-modal Video VAE

Yazhou Xing, Yang Fei, Yingqing He, Jingye Chen, Jiaxin Xie, Xiaowei, Chi, Qifeng Chen

PDF

Open Access

TL;DR

This paper introduces a novel video autoencoder that combines temporal-aware spatial compression, lightweight motion modeling, and text guidance to achieve high-fidelity, stable, and versatile video encoding and reconstruction.

Contribution

It proposes a new video VAE architecture that effectively disentangles spatial and temporal compression, incorporates text guidance, and is trained jointly on images and videos for improved performance.

Findings

01

Outperforms recent baselines in video reconstruction quality

02

Achieves high-fidelity and temporally stable video encoding

03

Enables joint image and video autoencoding

Abstract

Learning a robust video Variational Autoencoder (VAE) is essential for reducing video redundancy and facilitating efficient video generation. Directly applying image VAEs to individual frames in isolation can result in temporal inconsistencies and suboptimal compression rates due to a lack of temporal compression. Existing Video VAEs have begun to address temporal compression; however, they often suffer from inadequate reconstruction performance. In this paper, we present a novel and powerful video autoencoder capable of high-fidelity video encoding. First, we observe that entangling spatial and temporal compression by merely extending the image VAE to a 3D VAE can introduce motion blur and detail distortion artifacts. Thus, we propose temporal-aware spatial compression to better encode and decode the spatial information. Additionally, we integrate a lightweight motion compression model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image Processing Techniques · Image and Video Quality Assessment · Image and Signal Denoising Methods