OmnixR: Evaluating Omni-modality Language Models on Reasoning across   Modalities

Lichang Chen; Hexiang Hu; Mingda Zhang; Yiwen Chen; Zifeng; Wang; Yandong Li; Pranav Shyam; Tianyi Zhou; Heng Huang and; Ming-Hsuan Yang; Boqing Gong

arXiv:2410.12219·cs.AI·October 17, 2024

OmnixR: Evaluating Omni-modality Language Models on Reasoning across Modalities

Lichang Chen, Hexiang Hu, Mingda Zhang, Yiwen Chen, Zifeng, Wang, Yandong Li, Pranav Shyam, Tianyi Zhou, Heng Huang and, Ming-Hsuan Yang, Boqing Gong

PDF

Open Access

TL;DR

OmnixR is a comprehensive evaluation suite designed to benchmark state-of-the-art omni-modality language models, assessing their reasoning capabilities across multiple modalities like text, vision, and audio in synthetic and real-world scenarios.

Contribution

The paper introduces OmnixR, the first benchmark to evaluate multi-modal reasoning across diverse modalities, addressing limitations of existing single or dual-modality benchmarks.

Findings

01

State-of-the-art OLMs struggle with multi-modal reasoning tasks.

02

OmnixR reveals significant gaps in current models' cross-modal understanding.

03

Analysis highlights challenges in omni-modal AI alignment.

Abstract

We introduce OmnixR, an evaluation suite designed to benchmark SoTA Omni-modality Language Models, such as GPT-4o and Gemini. Evaluating OLMs, which integrate multiple modalities such as text, vision, and audio, presents unique challenges. Particularly, the user message might often consist of multiple modalities, such that OLMs have to establish holistic understanding and reasoning across modalities to accomplish the task. Existing benchmarks are limited to single modality or dual-modality tasks, overlooking comprehensive multi-modal assessments of model reasoning. To address this, OmnixR offers two evaluation variants: (1)synthetic subset: a synthetic dataset generated automatically by translating text into multiple modalities--audio, images, video, and hybrids (Omnify). (2)realistic subset: a real-world dataset, manually curated and annotated by experts, for evaluating cross-modal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems