Layer-Aware Video Composition via Split-then-Merge

Ozgur Kara; Yujia Chen; Ming-Hsuan Yang; James M. Rehg; Wen-Sheng Chu; Du Tran

arXiv:2511.20809·cs.CV·November 27, 2025

Layer-Aware Video Composition via Split-then-Merge

Ozgur Kara, Yujia Chen, Ming-Hsuan Yang, James M. Rehg, Wen-Sheng Chu, Du Tran

PDF

Open Access

TL;DR

The paper introduces Split-then-Merge (StM), a novel framework for controllable, realistic video composition that learns from unlabeled videos by separating and recombining foreground and background layers.

Contribution

StM is the first to split unlabeled videos into layers and self-compose them, enabling dynamic scene interaction learning without annotated datasets.

Findings

01

Outperforms state-of-the-art methods in quantitative benchmarks.

02

Achieves more realistic and controllable video generation.

03

Maintains foreground fidelity during blending.

Abstract

We present Split-then-Merge (StM), a novel framework designed to enhance control in generative video composition and address its data scarcity problem. Unlike conventional methods relying on annotated datasets or handcrafted rules, StM splits a large corpus of unlabeled videos into dynamic foreground and background layers, then self-composes them to learn how dynamic subjects interact with diverse scenes. This process enables the model to learn the complex compositional dynamics required for realistic video generation. StM introduces a novel transformation-aware training pipeline that utilizes a multi-layer fusion and augmentation to achieve affordance-aware composition, alongside an identity-preservation loss that maintains foreground fidelity during blending. Experiments show StM outperforms SoTA methods in both quantitative benchmarks and in humans/VLLM-based qualitative evaluations.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Human Pose and Action Recognition