Hi-VAE: Efficient Video Autoencoding with Global and Detailed Motion

Huaize Liu; Wenzhang Sun; Qiyuan Zhang; Donglin Di; Biao Gong; Hao Li; Chen Wei; Changqing Zou

arXiv:2506.07136·cs.CV·June 10, 2025

Hi-VAE: Efficient Video Autoencoding with Global and Detailed Motion

Huaize Liu, Wenzhang Sun, Qiyuan Zhang, Donglin Di, Biao Gong, Hao Li, Chen Wei, Changqing Zou

PDF

Open Access

TL;DR

Hi-VAE introduces a hierarchical video autoencoder that efficiently compresses video dynamics into global and detailed motion representations, achieving high compression rates while maintaining quality and enabling effective downstream tasks.

Contribution

The paper presents Hi-VAE, a novel hierarchical framework that decomposes video motion into global and detailed components for efficient encoding and high compression.

Findings

01

Achieves a compression factor of 1428×, significantly outperforming baseline methods.

02

Maintains high-quality video reconstruction at extreme compression rates.

03

Demonstrates effectiveness in downstream generative tasks.

Abstract

Recent breakthroughs in video autoencoders (Video AEs) have advanced video generation, but existing methods fail to efficiently model spatio-temporal redundancies in dynamics, resulting in suboptimal compression factors. This shortfall leads to excessive training costs for downstream tasks. To address this, we introduce Hi-VAE, an efficient video autoencoding framework that hierarchically encode coarse-to-fine motion representations of video dynamics and formulate the decoding process as a conditional generation task. Specifically, Hi-VAE decomposes video dynamics into two latent spaces: Global Motion, capturing overarching motion patterns, and Detailed Motion, encoding high-frequency spatial details. Using separate self-supervised motion encoders, we compress video latents into compact motion representations to reduce redundancy significantly. A conditional diffusion decoder then…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Human Motion and Animation