Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning

Dongjie Cheng; Yongqi Li; Zhixin Ma; Hongru Cai; Yupeng Hu; Wenjie Wang; Liqiang Nie; Wenjie Li

arXiv:2601.09536·cs.AI·April 21, 2026

Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning

Dongjie Cheng, Yongqi Li, Zhixin Ma, Hongru Cai, Yupeng Hu, Wenjie Wang, Liqiang Nie, Wenjie Li

PDF

1 Repo 2 Models 1 Datasets

TL;DR

This paper introduces Omni-R1, a unified generative framework for multimodal reasoning that generates intermediate images to handle diverse tasks, with a zero-shot variant that reduces annotation needs.

Contribution

The paper proposes a novel unified generative paradigm for multimodal reasoning and instantiates it with Omni-R1, including a zero-shot version that bypasses multimodal annotations.

Findings

01

Omni-R1 achieves unified reasoning across various multimodal tasks.

02

Omni-R1-Zero matches or surpasses Omni-R1 without multimodal annotations.

03

Empirical results demonstrate the effectiveness of the proposed approach.

Abstract

Multimodal Large Language Models (MLLMs) are making significant progress in multimodal reasoning. Early approaches focus on pure text-based reasoning. More recent studies have incorporated multimodal information into the reasoning steps; however, they often follow a single task-specific reasoning pattern, which limits their generalizability across various multimodal tasks. In fact, there are numerous multimodal tasks requiring diverse reasoning skills, such as zooming in on a specific region or marking an object within an image. To address this, we propose unified generative multimodal reasoning, which unifies diverse multimodal reasoning skills by generating intermediate images during the reasoning process. We instantiate this paradigm with Omni-R1, a two-stage SFT+RL framework featuring perception alignment loss and perception reward, thereby enabling functional image generation.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

modalitydance/Omni-R1
github

Models

Datasets

ModalityDance/Omni-Bench
dataset· 33 dl
33 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.