OmniEval: A Benchmark for Evaluating Omni-modal Models with Visual, Auditory, and Textual Inputs
Yiman Zhang, Ziheng Luo, Qiangyu Yan, Wei He, Borui Jiang, Xinghao Chen, Kai Han

TL;DR
OmniEval is a comprehensive benchmark designed to evaluate models that process visual, auditory, and textual data, emphasizing multimodal collaboration, diversity, and detailed task types to advance omni-modal understanding.
Contribution
We introduce OmniEval, a new benchmark with diverse, multi-modal tasks and datasets to evaluate omni-modal models comprehensively.
Findings
Models show varied performance across different modalities.
OmniEval reveals strengths and weaknesses of current omni-modal models.
Benchmark facilitates future development of integrated multi-modal AI systems.
Abstract
In this paper, we introduce OmniEval, a benchmark for evaluating omni-modality models like MiniCPM-O 2.6, which encompasses visual, auditory, and textual inputs. Compared with existing benchmarks, our OmniEval has several distinctive features: (i) Full-modal collaboration: We design evaluation tasks that highlight the strong coupling between audio and video, requiring models to effectively leverage the collaborative perception of all modalities; (ii) Diversity of videos: OmniEval includes 810 audio-visual synchronized videos, 285 Chinese videos and 525 English videos; (iii) Diversity and granularity of tasks: OmniEval contains 2617 question-answer pairs, comprising 1412 open-ended questions and 1205 multiple-choice questions. These questions are divided into 3 major task types and 12 sub-task types to achieve comprehensive evaluation. Among them, we introduce a more granular video…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. This paper accurately identifies the "blind spot" in current omni-modal evaluation—namely, the lack of assessment for synergistic understanding. The paper's core, original concept is its "full-modal collaboration" evaluation philosophy. 2. The inclusion of bilingual (CN/EN) support and the fine-grained "Grounding" (temporal localization) task effectively fills gaps left by existing benchmarks. 3. The hybrid pipeline described in Section 3.3 is excellent. The "Judgment" step (to ensure multi-
1. The benchmark contains a total of 2,617 Q&A pairs derived from 810 videos. When these are divided among 3 major categories, 12 sub-task types, and 2 languages, the number of samples for each fine-grained task (e.g., "Grounding" in Chinese) may be very small. This raises concerns about the statistical significance of the evaluation results. For example, the "Grounding" task has only 342 pairs in total; further subdivision by language and format (OE/MC) may result in an insufficient sample size
1) Unlike vision-only or audio-text setups, OmniEval evaluates joint A+V+T reasoning, with both English and Chinese coverage, which is still underexplored. 2) he mix of OE (1,278) and MC (1,133), distributed across 12 sub-tasks, supports both generative analysis and standardized accuracy comparisons; the Grounding category (moment/time-span) is a good addition. 3) The adaptive timestamp tolerance and IoU≥0.5 criteria for OE grounding are explicit and easy to re-implement, which helps reproduci
1) The pipeline excludes low-speech videos (ASR subdensity < 0.5), which systematically under-samples silent, music-dominant, or non-verbal soundscapes. This may bias the benchmark towards text-anchored items and may under-stress purely audio-visual fusion. A short analysis of discarded vs kept videos (content type, duration, genre) would help to clarify the bias. 2) OE scoring uses an LLM-as-judge/extractor but the paper provides no agreement statistics (e.g., κ with CIs) or dual-judge disagre
1. The proposed benchmark spans text, video, and audio, making it reflective of real-world multimodal task needs. 2. The proposed benchmark includes both English and Chinese videos, posing additional challenges for omni-foundation models.
1. A detailed comparison of different MLLMs on 12 tasks is missing. 2. Qualitative comparison of different MLLMs on the proposed benchmark is missing. 3. The authors highlighted temporal grounding as a key feature of this benchmark, but I could not find how MLLMs perform on this task in the experiments.
- Bilingual Benchmark: OmniEval is a bilingual video understanding benchmark that includes both English and Chinese videos and questions, which is valuable for evaluating multilingual models. - Audio-Visual Grounding Task: OmniEval introduces the "Grounding" task, which is an important capability for audio-visual understanding.
- Missing Comparison with Existing Benchmarks: For the evaluation of audio-visual video understanding, there are already established benchmarks (e.g., AVUT, DailyOmni), which are not discussed or compared in the paper. - Limitations of the Data Generation Methodology: The method of using an LLM to generate questions based on video captions and audio subtitles, while cost-effective, has several critical limitations: - Based on empirical evidence, Q&A pairs generated by LLMs tend to be of limite
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Speech and dialogue systems
