Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration

Hao Zhong; Muzhi Zhu; Zongze Du; Zheng Huang; Canyu Zhao; Mingyu Liu; Wen Wang; Hao Chen; Chunhua Shen

arXiv:2505.20256·cs.CV·May 27, 2025

Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration

Hao Zhong, Muzhi Zhu, Zongze Du, Zheng Huang, Canyu Zhao, Mingyu Liu, Wen Wang, Hao Chen, Chunhua Shen

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper introduces Omni-R1, a reinforcement learning framework with a two-system architecture for efficient and accurate omnimodal reasoning in video-audio tasks, addressing the trade-off between temporal coverage and pixel-level detail.

Contribution

It presents a novel RL-based approach for joint keyframe selection and pixel grounding, enabling scalable and generalizable omnimodal reasoning models.

Findings

01

Outperforms strong supervised and state-of-the-art models on RefAVS and REVOS benchmarks.

02

Enhances out-of-domain generalization and reduces multimodal hallucination.

03

Requires only one epoch of RL training on small task splits.

Abstract

Long-horizon video-audio reasoning and fine-grained pixel understanding impose conflicting requirements on omnimodal models: dense temporal coverage demands many low-resolution frames, whereas precise grounding calls for high-resolution inputs. We tackle this trade-off with a two-system architecture: a Global Reasoning System selects informative keyframes and rewrites the task at low spatial cost, while a Detail Understanding System performs pixel-level grounding on the selected high-resolution snippets. Because ``optimal'' keyframe selection and reformulation are ambiguous and hard to supervise, we formulate them as a reinforcement learning (RL) problem and present Omni-R1, an end-to-end RL framework built on Group Relative Policy Optimization. Omni-R1 trains the Global Reasoning System through hierarchical rewards obtained via online collaboration with the Detail Understanding System,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aim-uofa/omni-r1
pytorchOfficial

Models

🤗
Haoz0206/Omni-R1
model· 72 dl· ♡ 23
72 dl♡ 23

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation · Semantic Web and Ontologies