VidTwin: Video VAE with Decoupled Structure and Dynamics

Yuchi Wang; Junliang Guo; Xinyi Xie; Tianyu He; Xu Sun; Jiang Bian

arXiv:2412.17726·cs.CV·March 31, 2025

VidTwin: Video VAE with Decoupled Structure and Dynamics

Yuchi Wang, Junliang Guo, Xinyi Xie, Tianyu He, Xu Sun, Jiang Bian

PDF

Open Access 1 Repo 1 Models

TL;DR

VidTwin introduces a novel video autoencoder that decouples content and motion into separate latent spaces, enabling high compression, quality reconstruction, and scalable video generation.

Contribution

It proposes a compact, explainable Video VAE with decoupled structure and dynamics latent spaces, enhancing video compression and generation capabilities.

Findings

01

Achieves 0.20% compression rate with PSNR of 28.14 on MCL-JCV dataset

02

Demonstrates effective downstream video generation tasks

03

Shows scalability and explainability in latent video representation

Abstract

Recent advancements in video autoencoders (Video AEs) have significantly improved the quality and efficiency of video generation. In this paper, we propose a novel and compact video autoencoder, VidTwin, that decouples video into two distinct latent spaces: Structure latent vectors, which capture overall content and global movement, and Dynamics latent vectors, which represent fine-grained details and rapid movements. Specifically, our approach leverages an Encoder-Decoder backbone, augmented with two submodules for extracting these latent spaces, respectively. The first submodule employs a Q-Former to extract low-frequency motion trends, followed by downsampling blocks to remove redundant content details. The second averages the latent vectors along the spatial dimension to capture rapid motion. Extensive experiments show that VidTwin achieves a high compression rate of 0.20% with high…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/vidtok
pytorchOfficial

Models

🤗
microsoft/vidtwin
model· ♡ 2
♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Computer Graphics and Visualization Techniques · Image and Video Quality Assessment