OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering

Yiduo Jia; Muzhi Zhu; Hao Zhong; Mingyu Liu; Yuling Xi; Hao Chen; Bin Qin; Yongjie Yang; Zhenbo Luo; Chunhua Shen

arXiv:2604.08209·cs.CV·April 10, 2026

OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering

Yiduo Jia, Muzhi Zhu, Hao Zhong, Mingyu Liu, Yuling Xi, Hao Chen, Bin Qin, Yongjie Yang, Zhenbo Luo, Chunhua Shen

PDF

1 Repo

TL;DR

OmniJigsaw is a self-supervised framework that improves omni-modal reasoning by using a novel temporal reordering proxy task with strategic modality orchestration and data filtering.

Contribution

It introduces a generic self-supervised approach with three modality strategies and a two-stage filtering pipeline for effective omni-modal model training.

Findings

01

Significant improvements on 15 benchmarks across video, audio, and reasoning tasks.

02

Mitigates the bi-modal shortcut phenomenon with clip-level modality masking.

03

Outperforms existing methods in cross-modal understanding and reasoning.

Abstract

To extend the reinforcement learning post-training paradigm to omni-modal models for concurrently bolstering video-audio understanding and collaborative reasoning, we propose OmniJigsaw, a generic self-supervised framework built upon a temporal reordering proxy task. Centered on the chronological reconstruction of shuffled audio-visual clips, this paradigm strategically orchestrates visual and auditory signals to compel cross-modal integration through three distinct strategies: Joint Modality Integration, Sample-level Modality Selection, and Clip-level Modality Masking. Recognizing that the efficacy of such proxy tasks is fundamentally tied to puzzle quality, we design a two-stage coarse-to-fine data filtering pipeline, which facilitates the efficient adaptation of OmniJigsaw to massive unannotated omni-modal data. Our analysis reveals a ``bi-modal shortcut phenomenon'' in joint…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aim-uofa/OmniJigsaw
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.