SONIC-O1: A Real-World Benchmark for Evaluating Multimodal Large Language Models on Audio-Video Understanding

Ahmed Y. Radwan; Christos Emmanouilidis; Hina Tabassum; Deval Pandya; Shaina Raza

arXiv:2601.21666·cs.AI·January 30, 2026

SONIC-O1: A Real-World Benchmark for Evaluating Multimodal Large Language Models on Audio-Video Understanding

Ahmed Y. Radwan, Christos Emmanouilidis, Hina Tabassum, Deval Pandya, Shaina Raza

PDF

Open Access 1 Datasets

TL;DR

SONIC-O1 is a new benchmark designed to evaluate multimodal large language models on real-world audio-video understanding tasks, highlighting current limitations and disparities across models and demographics.

Contribution

This paper introduces SONIC-O1, a comprehensive benchmark for assessing MLLMs on sequential audio-video data in real-world scenarios, filling a critical evaluation gap.

Findings

01

Significant performance gap in temporal localization between model types.

02

Model performance varies across demographic groups.

03

Current models show limitations in real-world audio-video understanding.

Abstract

Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored. This gap highlights the need for a high-quality benchmark to systematically evaluate MLLM performance in a real-world setting. We introduce SONIC-O1, a comprehensive, fully human-verified benchmark spanning 13 real-world conversational domains with 4,958 annotations and demographic metadata. SONIC-O1 evaluates MLLMs on key tasks, including open-ended summarization, multiple-choice question (MCQ) answering, and temporal localization with supporting rationales (reasoning). Experiments on closed- and open-source models reveal limitations. While the performance gap in MCQ accuracy between two model families is relatively small, we observe a substantial 22.6%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

vector-institute/sonic-o1
dataset· 269 dl
269 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning