TL;DR
Dynin-Omni is a pioneering masked-diffusion-based model that unifies text, image, speech, and video understanding and generation within a single architecture, enabling iterative refinement across modalities.
Contribution
It introduces the first omnimodal foundation model using masked diffusion over a shared token space, surpassing previous models in diverse multimodal benchmarks.
Findings
Achieves state-of-the-art results on 19 multimodal benchmarks.
Outperforms existing open-source unified models across multiple tasks.
Demonstrates the effectiveness of masked diffusion as a universal modeling paradigm.
Abstract
We present Dynin-Omni, the first masked-diffusion-based omnimodal foundation model that unifies text, image, and speech understanding and generation, together with video understanding, within a single architecture. Unlike autoregressive unified models that serialize heterogeneous modalities, or compositional unified models that require orchestration with external modality-specific decoders, Dynin-Omni natively formulates omnimodal modeling as masked diffusion over a shared discrete token space, enabling iterative refinement under bidirectional context. Dynin-Omni adopts a multi-stage training strategy with model-merging-based modality expansion and omnimodal alignment. We evaluate Dynin-Omni across 19 multimodal benchmarks spanning language reasoning, image generation and editing, video understanding, and speech recognition and synthesis. Dynin-Omni achieves 87.6 on GSM8K, 1733.6 on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
