TL;DR
URSA introduces a novel discrete diffusion framework with metric path and resolution-dependent timestep shifting, enabling efficient high-resolution and long-duration video generation with fewer inference steps, matching continuous methods' performance.
Contribution
It presents URSA, a new discrete video generation method with innovative designs that improve scalability and efficiency, bridging the gap with continuous approaches.
Findings
Outperforms existing discrete methods on benchmarks.
Achieves comparable performance to state-of-the-art continuous diffusion models.
Supports versatile tasks like interpolation and image-to-video generation.
Abstract
Continuous-space video generation has advanced rapidly, while discrete approaches lag behind due to error accumulation and long-context inconsistency. In this work, we revisit discrete generative modeling and present Uniform discRete diffuSion with metric pAth (URSA), a simple yet powerful framework that bridges the gap with continuous approaches for the scalable video generation. At its core, URSA formulates the video generation task as an iterative global refinement of discrete spatiotemporal tokens. It integrates two key designs: a Linearized Metric Path and a Resolution-dependent Timestep Shifting mechanism. These designs enable URSA to scale efficiently to high-resolution image synthesis and long-duration video generation, while requiring significantly fewer inference steps. Additionally, we introduce an asynchronous temporal fine-tuning strategy that unifies versatile tasks within…
Peer Reviews
Decision·ICLR 2026 Poster
This work represents a solid application of discrete diffusion models to video generation. The experiments comprehensively show UDM's advantages on multiple generation benchmarks, achieving competitive results with both discrete and continuous baselines. The proposed framework demonstrates the potential for scalable and unified visual generation.
The paper's claimed key innovations lack sufficient novelty and rigorous justification. * **Metric Path Novelty:** The core concept of a metric path has been previously proposed and applied to multimodal tasks[1,2]. While cited, the discussion of the relationship to these works is inadequate. The primary novelty lies in the "linear relationship" established by Eq. 4. However, the motivation and precise meaning of preserving a "linear relationship between $t$ and $d(x_t, x_1)$" in line 229-230
* Clarity & focus: Clean objective and sampling recipe; equations and schedules are easy to implement. * Unification: One model handles multiple video tasks via per-frame timesteps. * Practicality: Simple schedules; no exotic losses or architectures required. * Results: Broad benchmarks indicate strong quality and temporal stability with modest steps. * Positioning: Bridges discrete tokenization with diffusion-style global refinement, relevant for LLM-aligned video generation.
* Novelty overlap: The *frame-level/asynchronous timestep* contribution is close to prior video schedulers (e.g., SkyReels-V2) and essentially the same idea as Pusa & FVDM; the paper should clarify what is new beyond this reuse. * Relation to DFM/schedulers: The metric path and schedule resemble discrete flow matching/kinetic schedules; limited theory for superiority beyond heuristic tuning. * Ablation depth: Missing systematic sweeps for the shift parameter \lambda; marginal gains of asynchrono
1. The linearized metric-path and timestep shifting mechanism looks sound to me. The motivation is clear and the theoretical explanation is reasonable. I think the authors find an effective manner to make the discrete diffusion model more powerful. 2. Experienmental analysis are comprehensive. The authors compare many state-of-the-art advances on three tasks: text-to-image generation, text-to-video generation, and image-to-video generation on several benchmarks. The results show that the propos
1. The scalability of the proposed uniform discrete diffusion method seems to be unclear. As the authors claim that scalability is one of the key highlight factors of the proposed method, it is crucial to conduct experiments on different model sizes to probe its scalability. This part of the experiment is completely missing both in the main manuscript and the attachment. 2. Some model details are missing. What is the size of the tokenizer and the Qwen LLM? What is the rationale for choosing Qwe
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
