FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities

Jin Wang; Yao Lai; Aoxue Li; Shifeng Zhang; Jiacheng Sun; Ning Kang; Chengyue Wu; Zhenguo Li; Ping Luo

arXiv:2505.20147·cs.CV·July 25, 2025

FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities

Jin Wang, Yao Lai, Aoxue Li, Shifeng Zhang, Jiacheng Sun, Ning Kang, Chengyue Wu, Zhenguo Li, Ping Luo

PDF

Open Access 1 Models 1 Video

TL;DR

FUDOKI introduces a discrete flow-based unified multimodal model that surpasses autoregressive limitations, enabling iterative refinement and bidirectional context integration for visual understanding and image generation.

Contribution

It presents the first discrete flow matching approach for multimodal models, replacing autoregressive architectures with a more flexible, iterative, and self-correcting framework.

Findings

01

Achieves performance comparable to state-of-the-art AR-based models.

02

Enables iterative refinement and bidirectional context during generation.

03

Test-time scaling improves performance significantly.

Abstract

The rapid progress of large language models (LLMs) has catalyzed the emergence of multimodal large language models (MLLMs) that unify visual understanding and image generation within a single framework. However, most existing MLLMs rely on autoregressive (AR) architectures, which impose inherent limitations on future development, such as the raster-scan order in image generation and restricted reasoning abilities in causal context modeling. In this work, we challenge the dominance of AR-based approaches by introducing FUDOKI, a unified multimodal model purely based on discrete flow matching, as an alternative to conventional AR paradigms. By leveraging metric-induced probability paths with kinetic optimal velocities, our framework goes beyond the previous masking-based corruption process, enabling iterative refinement with self-correction capability and richer bidirectional context…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
LucasJinWang/FUDOKI
model· 211 dl· ♡ 4
211 dl♡ 4

Videos

FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities· slideslive

Taxonomy

TopicsReinforcement Learning in Robotics