LayerT2V: A Unified Multi-Layer Video Generation Framework

Guangzhao Li; Kangrui Cen; Baixuan Zhao; Yi Xin; Siqi Luo; Guangtao Zhai; Lei Zhang; Xiaohong Liu

arXiv:2508.04228·cs.CV·February 27, 2026

LayerT2V: A Unified Multi-Layer Video Generation Framework

Guangzhao Li, Kangrui Cen, Baixuan Zhao, Yi Xin, Siqi Luo, Guangtao Zhai, Lei Zhang, Xiaohong Liu

PDF

TL;DR

LayerT2V is a novel multi-layer video generation framework that produces layered, editable videos with improved semantic and temporal coherence, enabling more professional and flexible video editing workflows.

Contribution

It introduces a unified multi-layer video generation approach with a shared backbone, LayerAdaLN, and layer-aware cross-attention, along with the VidLayer dataset for training and evaluation.

Findings

01

Outperforms prior methods in visual fidelity and temporal coherence

02

Produces semantically consistent background and foreground layers

03

Enhances cross-layer coherence and editing flexibility

Abstract

Text-to-video generation has advanced rapidly, but existing methods typically output only the final composited video and lack editable layered representations, limiting their use in professional workflows. We propose \textbf{LayerT2V}, a unified multi-layer video generation framework that produces multiple semantically consistent outputs in a single inference pass: the full video, an independent background layer, and multiple foreground RGB layers with corresponding alpha mattes. Our key insight is that recent video generation backbones use high compression in both time and space, enabling us to serialize multiple layer representations along the temporal dimension and jointly model them on a shared generation trajectory. This turns cross-layer consistency into an intrinsic objective, improving semantic alignment and temporal coherence. To mitigate layer ambiguity and conditional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.