Uniform Discrete Diffusion with Metric Path for Video Generation

Haoge Deng; Ting Pan; Fan Zhang; Yang Liu; Zhuoyan Luo; Yufeng Cui; Wenxuan Wang; Chunhua Shen; Shiguang Shan; Zhaoxiang Zhang; Xinlong Wang

arXiv:2510.24717·cs.CV·October 29, 2025

Uniform Discrete Diffusion with Metric Path for Video Generation

Haoge Deng, Ting Pan, Fan Zhang, Yang Liu, Zhuoyan Luo, Yufeng Cui, Wenxuan Wang, Chunhua Shen, Shiguang Shan, Zhaoxiang Zhang, Xinlong Wang

PDF

4 Models 3 Reviews

TL;DR

URSA introduces a novel discrete diffusion framework with metric path and resolution-dependent timestep shifting, enabling efficient high-resolution and long-duration video generation with fewer inference steps, matching continuous methods' performance.

Contribution

It presents URSA, a new discrete video generation method with innovative designs that improve scalability and efficiency, bridging the gap with continuous approaches.

Findings

01

Outperforms existing discrete methods on benchmarks.

02

Achieves comparable performance to state-of-the-art continuous diffusion models.

03

Supports versatile tasks like interpolation and image-to-video generation.

Abstract

Continuous-space video generation has advanced rapidly, while discrete approaches lag behind due to error accumulation and long-context inconsistency. In this work, we revisit discrete generative modeling and present Uniform discRete diffuSion with metric pAth (URSA), a simple yet powerful framework that bridges the gap with continuous approaches for the scalable video generation. At its core, URSA formulates the video generation task as an iterative global refinement of discrete spatiotemporal tokens. It integrates two key designs: a Linearized Metric Path and a Resolution-dependent Timestep Shifting mechanism. These designs enable URSA to scale efficiently to high-resolution image synthesis and long-duration video generation, while requiring significantly fewer inference steps. Additionally, we introduce an asynchronous temporal fine-tuning strategy that unifies versatile tasks within…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

This work represents a solid application of discrete diffusion models to video generation. The experiments comprehensively show UDM's advantages on multiple generation benchmarks, achieving competitive results with both discrete and continuous baselines. The proposed framework demonstrates the potential for scalable and unified visual generation.

Weaknesses

The paper's claimed key innovations lack sufficient novelty and rigorous justification. * **Metric Path Novelty:** The core concept of a metric path has been previously proposed and applied to multimodal tasks[1,2]. While cited, the discussion of the relationship to these works is inadequate. The primary novelty lies in the "linear relationship" established by Eq. 4. However, the motivation and precise meaning of preserving a "linear relationship between $t$ and $d(x_t, x_1)$" in line 229-230

Reviewer 02Rating 4Confidence 4

Strengths

* Clarity & focus: Clean objective and sampling recipe; equations and schedules are easy to implement. * Unification: One model handles multiple video tasks via per-frame timesteps. * Practicality: Simple schedules; no exotic losses or architectures required. * Results: Broad benchmarks indicate strong quality and temporal stability with modest steps. * Positioning: Bridges discrete tokenization with diffusion-style global refinement, relevant for LLM-aligned video generation.

Weaknesses

* Novelty overlap: The *frame-level/asynchronous timestep* contribution is close to prior video schedulers (e.g., SkyReels-V2) and essentially the same idea as Pusa & FVDM; the paper should clarify what is new beyond this reuse. * Relation to DFM/schedulers: The metric path and schedule resemble discrete flow matching/kinetic schedules; limited theory for superiority beyond heuristic tuning. * Ablation depth: Missing systematic sweeps for the shift parameter \lambda; marginal gains of asynchrono

Reviewer 03Rating 6Confidence 3

Strengths

1. The linearized metric-path and timestep shifting mechanism looks sound to me. The motivation is clear and the theoretical explanation is reasonable. I think the authors find an effective manner to make the discrete diffusion model more powerful. 2. Experienmental analysis are comprehensive. The authors compare many state-of-the-art advances on three tasks: text-to-image generation, text-to-video generation, and image-to-video generation on several benchmarks. The results show that the propos

Weaknesses

1. The scalability of the proposed uniform discrete diffusion method seems to be unclear. As the authors claim that scalability is one of the key highlight factors of the proposed method, it is crucial to conduct experiments on different model sizes to probe its scalability. This part of the experiment is completely missing both in the main manuscript and the attachment. 2. Some model details are missing. What is the size of the tokenizer and the Qwen LLM? What is the rationale for choosing Qwe

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.