DeCo-VAE: Learning Compact Latents for Video Reconstruction via Decoupled Representation

Xiangchen Yin; Jiahui Yuan; Zhangchi Hu; Wenzhang Sun; Jie Chen; Xiaozhen Qiao; Hao Li; Xiaoyan Sun

arXiv:2511.14530·cs.CV·November 19, 2025

DeCo-VAE: Learning Compact Latents for Video Reconstruction via Decoupled Representation

Xiangchen Yin, Jiahui Yuan, Zhangchi Hu, Wenzhang Sun, Jie Chen, Xiaozhen Qiao, Hao Li, Xiaoyan Sun

PDF

Open Access

TL;DR

DeCo-VAE introduces a decoupled approach to video VAEs, decomposing content into keyframes, motion, and residuals, leading to more compact latent representations and improved reconstruction quality.

Contribution

The paper proposes a novel decoupled VAE architecture with dedicated encoders for video components, enhancing latent compactness and reconstruction accuracy.

Findings

01

Achieves superior video reconstruction performance

02

Effectively decomposes video content into distinct components

03

Ensures stable training through decoupled adaptation strategy

Abstract

Existing video Variational Autoencoders (VAEs) generally overlook the similarity between frame contents, leading to redundant latent modeling. In this paper, we propose decoupled VAE (DeCo-VAE) to achieve compact latent representation. Instead of encoding RGB pixels directly, we decompose video content into distinct components via explicit decoupling: keyframe, motion and residual, and learn dedicated latent representation for each. To avoid cross-component interference, we design dedicated encoders for each decoupled component and adopt a shared 3D decoder to maintain spatiotemporal consistency during reconstruction. We further utilize a decoupled adaptation strategy that freezes partial encoders while training the others sequentially, ensuring stable training and accurate learning of both static and dynamic features. Extensive quantitative and qualitative experiments demonstrate that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Face recognition and analysis