Hi-VAE: Efficient Video Autoencoding with Global and Detailed Motion
Huaize Liu, Wenzhang Sun, Qiyuan Zhang, Donglin Di, Biao Gong, Hao Li, Chen Wei, Changqing Zou

TL;DR
Hi-VAE introduces a hierarchical video autoencoder that efficiently compresses video dynamics into global and detailed motion representations, achieving high compression rates while maintaining quality and enabling effective downstream tasks.
Contribution
The paper presents Hi-VAE, a novel hierarchical framework that decomposes video motion into global and detailed components for efficient encoding and high compression.
Findings
Achieves a compression factor of 1428×, significantly outperforming baseline methods.
Maintains high-quality video reconstruction at extreme compression rates.
Demonstrates effectiveness in downstream generative tasks.
Abstract
Recent breakthroughs in video autoencoders (Video AEs) have advanced video generation, but existing methods fail to efficiently model spatio-temporal redundancies in dynamics, resulting in suboptimal compression factors. This shortfall leads to excessive training costs for downstream tasks. To address this, we introduce Hi-VAE, an efficient video autoencoding framework that hierarchically encode coarse-to-fine motion representations of video dynamics and formulate the decoding process as a conditional generation task. Specifically, Hi-VAE decomposes video dynamics into two latent spaces: Global Motion, capturing overarching motion patterns, and Detailed Motion, encoding high-frequency spatial details. Using separate self-supervised motion encoders, we compress video latents into compact motion representations to reduce redundancy significantly. A conditional diffusion decoder then…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Human Motion and Animation
