OmniBench: Towards The Future of Universal Omni-Language Models

Yizhi Li; Yinghao Ma; Ge Zhang; Ruibin Yuan; Kang Zhu; Hangyu Guo; Yiming Liang; Jiaheng Liu; Zekun Wang; Jian Yang; Siwei Wu; Xingwei Qu; Jinjie Shi; Xinyue Zhang; Zhenzhu Yang; Yidan Wen; Yanghai Wang; Shihao Li; Zhaoxiang Zhang; Zachary Liu; Emmanouil Benetos; Wenhao Huang; Chenghua Lin

arXiv:2409.15272·cs.CL·January 1, 2026

OmniBench: Towards The Future of Universal Omni-Language Models

Yizhi Li, Yinghao Ma, Ge Zhang, Ruibin Yuan, Kang Zhu, Hangyu Guo, Yiming Liang, Jiaheng Liu, Zekun Wang, Jian Yang, Siwei Wu, Xingwei Qu, Jinjie Shi, Xinyue Zhang, Zhenzhu Yang, Yidan Wen, Yanghai Wang, Shihao Li, Zhaoxiang Zhang, Zachary Liu, Emmanouil Benetos, Wenhao Huang

PDF

Open Access 1 Repo 3 Datasets

TL;DR

OmniBench introduces a comprehensive benchmark for evaluating models' ability to process and reason across visual, acoustic, and textual modalities simultaneously, revealing current limitations and guiding future research in tri-modal AI systems.

Contribution

The paper presents OmniBench, a new benchmark with high-quality annotations for tri-modal understanding, and curates OmniInstruct, an instruction tuning dataset to improve omni-language models.

Findings

01

Open-source OLMs show limited reasoning in tri-modal tasks.

02

Most models perform below 50% accuracy on tri-modal benchmarks.

03

Existing training paradigms often overlook multi-modal context construction.

Abstract

Recent advancements in multimodal large language models (MLLMs) have aimed to integrate and interpret data across diverse modalities. However, the capacity of these models to concurrently process and reason about multiple modalities remains underexplored, partly due to the lack of comprehensive modality-wise benchmarks. We introduce OmniBench, a novel benchmark designed to rigorously evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. We define language models capable of such tri-modal processing as the omni-language models (OLMs). OmniBench is distinguished by high-quality human annotations, ensuring that accurate responses require integrated understanding and reasoning across all three modalities. Our main findings reveal that: i) open-source OLMs exhibit critical limitations in instruction-following and reasoning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

multimodal-art-projection/omnibench
noneOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTransportation and Mobility Innovations

MethodsFocus