Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning
Dongjie Cheng, Yongqi Li, Zhixin Ma, Hongru Cai, Yupeng Hu, Wenjie Wang, Liqiang Nie, Wenjie Li

TL;DR
This paper introduces Omni-R1, a unified generative framework for multimodal reasoning that generates intermediate images to handle diverse tasks, with a zero-shot variant that reduces annotation needs.
Contribution
The paper proposes a novel unified generative paradigm for multimodal reasoning and instantiates it with Omni-R1, including a zero-shot version that bypasses multimodal annotations.
Findings
Omni-R1 achieves unified reasoning across various multimodal tasks.
Omni-R1-Zero matches or surpasses Omni-R1 without multimodal annotations.
Empirical results demonstrate the effectiveness of the proposed approach.
Abstract
Multimodal Large Language Models (MLLMs) are making significant progress in multimodal reasoning. Early approaches focus on pure text-based reasoning. More recent studies have incorporated multimodal information into the reasoning steps; however, they often follow a single task-specific reasoning pattern, which limits their generalizability across various multimodal tasks. In fact, there are numerous multimodal tasks requiring diverse reasoning skills, such as zooming in on a specific region or marking an object within an image. To address this, we propose unified generative multimodal reasoning, which unifies diverse multimodal reasoning skills by generating intermediate images during the reasoning process. We instantiate this paradigm with Omni-R1, a two-stage SFT+RL framework featuring perception alignment loss and perception reward, thereby enabling functional image generation.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
