OmniBench: Towards The Future of Universal Omni-Language Models
Yizhi Li, Yinghao Ma, Ge Zhang, Ruibin Yuan, Kang Zhu, Hangyu Guo, Yiming Liang, Jiaheng Liu, Zekun Wang, Jian Yang, Siwei Wu, Xingwei Qu, Jinjie Shi, Xinyue Zhang, Zhenzhu Yang, Yidan Wen, Yanghai Wang, Shihao Li, Zhaoxiang Zhang, Zachary Liu, Emmanouil Benetos, Wenhao Huang

TL;DR
OmniBench introduces a comprehensive benchmark for evaluating models' ability to process and reason across visual, acoustic, and textual modalities simultaneously, revealing current limitations and guiding future research in tri-modal AI systems.
Contribution
The paper presents OmniBench, a new benchmark with high-quality annotations for tri-modal understanding, and curates OmniInstruct, an instruction tuning dataset to improve omni-language models.
Findings
Open-source OLMs show limited reasoning in tri-modal tasks.
Most models perform below 50% accuracy on tri-modal benchmarks.
Existing training paradigms often overlook multi-modal context construction.
Abstract
Recent advancements in multimodal large language models (MLLMs) have aimed to integrate and interpret data across diverse modalities. However, the capacity of these models to concurrently process and reason about multiple modalities remains underexplored, partly due to the lack of comprehensive modality-wise benchmarks. We introduce OmniBench, a novel benchmark designed to rigorously evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. We define language models capable of such tri-modal processing as the omni-language models (OLMs). OmniBench is distinguished by high-quality human annotations, ensuring that accurate responses require integrated understanding and reasoning across all three modalities. Our main findings reveal that: i) open-source OLMs exhibit critical limitations in instruction-following and reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTransportation and Mobility Innovations
MethodsFocus
