PUSA V1.0: Surpassing Wan-I2V with $500 Training Cost by Vectorized Timestep Adaptation

Yaofang Liu; Yumeng Ren; Aitor Artola; Yuxuan Hu; Xiaodong Cun; Xiaotong Zhao; Alan Zhao; Raymond H. Chan; Suiyun Zhang; Rui Liu; Dandan Tu; Jean-Michel Morel

arXiv:2507.16116·cs.CV·July 23, 2025

PUSA V1.0: Surpassing Wan-I2V with $500 Training Cost by Vectorized Timestep Adaptation

Yaofang Liu, Yumeng Ren, Aitor Artola, Yuxuan Hu, Xiaodong Cun, Xiaotong Zhao, Alan Zhao, Raymond H. Chan, Suiyun Zhang, Rui Liu, Dandan Tu, Jean-Michel Morel

PDF

Open Access 3 Models 2 Datasets

TL;DR

Pusa introduces a vectorized timestep adaptation method that significantly improves video diffusion models' efficiency and capabilities, enabling high-quality video generation with minimal training cost and dataset size, while preserving the base model's strengths.

Contribution

The paper presents VTA, a non-destructive, scalable approach that enhances temporal control in video diffusion models, surpassing previous methods in efficiency and versatility.

Findings

01

Achieved superior performance with only 1/200 training cost of baseline.

02

Enabled zero-shot multi-task video generation capabilities.

03

Set new standards in image-to-video generation quality.

Abstract

The rapid advancement of video diffusion models has been hindered by fundamental limitations in temporal modeling, particularly the rigid synchronization of frame evolution imposed by conventional scalar timestep variables. While task-specific adaptations and autoregressive models have sought to address these challenges, they remain constrained by computational inefficiency, catastrophic forgetting, or narrow applicability. In this work, we present Pusa, a groundbreaking paradigm that leverages vectorized timestep adaptation (VTA) to enable fine-grained temporal control within a unified video diffusion framework. Besides, VTA is a non-destructive adaptation, which means it fully preserves the capabilities of the base model. By finetuning the SOTA Wan2.1-T2V-14B model with VTA, we achieve unprecedented efficiency -- surpassing the performance of Wan-I2V-14B with $\leq$ 1/200 of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Advanced Neural Network Applications · Industrial Vision Systems and Defect Detection