TL;DR
OmniJigsaw is a self-supervised framework that improves omni-modal reasoning by using a novel temporal reordering proxy task with strategic modality orchestration and data filtering.
Contribution
It introduces a generic self-supervised approach with three modality strategies and a two-stage filtering pipeline for effective omni-modal model training.
Findings
Significant improvements on 15 benchmarks across video, audio, and reasoning tasks.
Mitigates the bi-modal shortcut phenomenon with clip-level modality masking.
Outperforms existing methods in cross-modal understanding and reasoning.
Abstract
To extend the reinforcement learning post-training paradigm to omni-modal models for concurrently bolstering video-audio understanding and collaborative reasoning, we propose OmniJigsaw, a generic self-supervised framework built upon a temporal reordering proxy task. Centered on the chronological reconstruction of shuffled audio-visual clips, this paradigm strategically orchestrates visual and auditory signals to compel cross-modal integration through three distinct strategies: Joint Modality Integration, Sample-level Modality Selection, and Clip-level Modality Masking. Recognizing that the efficacy of such proxy tasks is fundamentally tied to puzzle quality, we design a two-stage coarse-to-fine data filtering pipeline, which facilitates the efficient adaptation of OmniJigsaw to massive unannotated omni-modal data. Our analysis reveals a ``bi-modal shortcut phenomenon'' in joint…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
