SWiT-4D: Sliding-Window Transformer for Lossless and Parameter-Free Temporal 4D Generation

Kehong Gong; Zhengyu Wen; Mingxi Xu; Weixia He; Qi Wang; Ning Zhang; Zhengyu Li; Chenbin Li; Dongze Lian; Wei Zhao; Xiaoyu He; and Mingyuan Zhang

arXiv:2512.10860·cs.CV·December 12, 2025

SWiT-4D: Sliding-Window Transformer for Lossless and Parameter-Free Temporal 4D Generation

Kehong Gong, Zhengyu Wen, Mingxi Xu, Weixia He, Qi Wang, Ning Zhang, Zhengyu Li, Chenbin Li, Dongze Lian, Wei Zhao, Xiaoyu He, and Mingyuan Zhang

PDF

Open Access

TL;DR

SWiT-4D introduces a lossless, parameter-free transformer framework that effectively converts monocular videos into high-quality 4D meshes, leveraging image-to-3D priors with minimal supervision for robust temporal 4D generation.

Contribution

It presents SWiT-4D, a novel sliding-window transformer that integrates with existing image-to-3D models to enable lossless, parameter-free temporal 4D mesh reconstruction from videos.

Findings

01

Achieves high-fidelity 4D meshes with only short video fine-tuning.

02

Outperforms existing methods in temporal smoothness and stability.

03

Demonstrates strong data efficiency and generalization across benchmarks.

Abstract

Despite significant progress in 4D content generation, the conversion of monocular videos into high-quality animated 3D assets with explicit 4D meshes remains considerably challenging. The scarcity of large-scale, naturally captured 4D mesh datasets further limits the ability to train generalizable video-to-4D models from scratch in a purely data-driven manner. Meanwhile, advances in image-to-3D generation, supported by extensive datasets, offer powerful prior models that can be leveraged. To better utilize these priors while minimizing reliance on 4D supervision, we introduce SWiT-4D, a Sliding-Window Transformer for lossless, parameter-free temporal 4D mesh generation. SWiT-4D integrates seamlessly with any Diffusion Transformer (DiT)-based image-to-3D generator, adding spatial-temporal modeling across video frames while preserving the original single-image forward process, enabling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Shape Modeling and Analysis · Advanced Vision and Imaging · Computer Graphics and Visualization Techniques