LayerT2V: A Unified Multi-Layer Video Generation Framework
Guangzhao Li, Kangrui Cen, Baixuan Zhao, Yi Xin, Siqi Luo, Guangtao Zhai, Lei Zhang, Xiaohong Liu

TL;DR
LayerT2V is a novel multi-layer video generation framework that produces layered, editable videos with improved semantic and temporal coherence, enabling more professional and flexible video editing workflows.
Contribution
It introduces a unified multi-layer video generation approach with a shared backbone, LayerAdaLN, and layer-aware cross-attention, along with the VidLayer dataset for training and evaluation.
Findings
Outperforms prior methods in visual fidelity and temporal coherence
Produces semantically consistent background and foreground layers
Enhances cross-layer coherence and editing flexibility
Abstract
Text-to-video generation has advanced rapidly, but existing methods typically output only the final composited video and lack editable layered representations, limiting their use in professional workflows. We propose \textbf{LayerT2V}, a unified multi-layer video generation framework that produces multiple semantically consistent outputs in a single inference pass: the full video, an independent background layer, and multiple foreground RGB layers with corresponding alpha mattes. Our key insight is that recent video generation backbones use high compression in both time and space, enabling us to serialize multiple layer representations along the temporal dimension and jointly model them on a shared generation trajectory. This turns cross-layer consistency into an intrinsic objective, improving semantic alignment and temporal coherence. To mitigate layer ambiguity and conditional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
