Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

Che Liu; Lichao Ma; Xiangyu Tony Zhang; Yuxin Zhang; Haoyang Zhang; Xuerui Yang; and Fei Tian

arXiv:2605.12034·cs.MM·May 15, 2026

Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

Che Liu, Lichao Ma, Xiangyu Tony Zhang, Yuxin Zhang, Haoyang Zhang, Xuerui Yang, and Fei Tian

PDF

1 Repo

TL;DR

This paper introduces OmniClean, a visually debiased benchmark for omni-modal models, and demonstrates that staged post-training with self-distilled data improves model performance while controlling visual shortcuts.

Contribution

It presents a new benchmark, OmniClean, and a three-stage post-training recipe, OmniBoost, that enhances omni-modal model performance with better evaluation controls.

Findings

01

OmniClean filters out visual shortcuts, providing a more accurate benchmark.

02

Self-distilled omni-query supervision improves model performance.

03

Small models benefit from staged post-training with self-distillation.

Abstract

Omni-modal language models are intended to jointly understand audio, visual inputs, and language, but benchmark gains can be inflated when visual evidence alone is enough to answer a query. We study whether current omni-modal benchmarks separate visual shortcuts from genuine audio-visual-language evidence integration, and how post-training behaves under a visually debiased evaluation setting. We audit nine omni-modal benchmarks with visual-only probing, remove visually solvable queries, and retain full subsets when filtering is undefined or would make comparisons unstable. This yields OmniClean, a cleaned evaluation view with 8,551 retained queries from 16,968 audited queries. On OmniClean, we evaluate OmniBoost, a three-stage post-training recipe based on Qwen2.5-Omni-3B: mixed bi-modal SFT, mixed-modality RLVR, and SFT on self-distilled data. Balanced bi-modal SFT gives limited and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://cheliu-computation.github.io/omni
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.