Omni-o3: Deep Nested Omnimodal Deduction for Deliberative Audio-Visual Reasoning

Zhicheng Zhang; Wentao Gu; Weicheng Wang; Yongjie Zhu; Wenyu Qin; Meng Wang; Pengfei Wan; Jufeng Yang

arXiv:2604.24191·cs.CV·April 28, 2026

Omni-o3: Deep Nested Omnimodal Deduction for Deliberative Audio-Visual Reasoning

Zhicheng Zhang, Wentao Gu, Weicheng Wang, Yongjie Zhu, Wenyu Qin, Meng Wang, Pengfei Wan, Jufeng Yang

PDF

TL;DR

Omni-o3 introduces a deep nested deduction framework for efficient, recursive audio-visual reasoning, sharing reasoning paths and improving exploration in complex tasks.

Contribution

It presents a novel recursive search-based reasoning paradigm with a two-stage training process for enhanced multi-modal understanding.

Findings

01

Achieves competitive results on 11 benchmarks.

02

Enables advanced audio-visual and multi-turn reasoning.

03

Shares reasoning prefixes to improve exploration efficiency.

Abstract

Omnimodal understanding entails a massive, highly redundant search space of cross-modal interactions, demanding focused and deliberative reasoning. Current reasoning paradigms rely on either sequential step-by-step generation or parallel sample-by-sample rollouts, leading to isolated reasoning trajectories. This inability to share promising intermediate paths severely limits exploration efficiency and causes compounding errors in complex audio-visual tasks. To break this bottleneck, we introduce Omni-o3, a novel framework driven by a deep nested deduction policy. By formulating reasoning as a dynamic recursive search, Omni-o3 inherently shares reasoning prefixes across branches, enabling the iterative execution of four atomic cognitive actions: expansion, selection, simulation, and backpropagation. To empower this framework, we propose a robust two-stage training paradigm: (1)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.