TL;DR
Daily-Omni introduces a comprehensive audio-visual reasoning benchmark with real-world videos, emphasizing the importance of temporal alignment across modalities, revealing current models' struggles with synchronization tasks.
Contribution
The paper presents a new benchmark for cross-modal temporal reasoning, a semi-automatic annotation pipeline, and extensive evaluation of models' ability to handle audio-visual alignment.
Findings
Many models struggle with alignment-critical questions.
Explicit temporal signals improve model performance.
End-to-end models still face challenges in synchronization tasks.
Abstract
Recent Multimodal Large Language Models (MLLMs) achieve promising performance on visual and audio benchmarks independently. However, the ability of these models to process cross-modal information synchronously remains largely unexplored. We introduce Daily-Omni, a multiple-choice Audio-Visual QA benchmark featuring 684 real-world videos and 1,197 questions spanning 6 task families that explicitly require cross-modal temporal reasoning. To support scalable benchmark construction, we develop a semi-automatic pipeline for annotation, cross-modal consistency refinement, temporal alignment elicitation, and text-only leakage filtering, followed by human verification. We further provide a diagnostic evaluation suite and extensively evaluate 24 foundation models under 37 model--modality settings (Audio+Video / Audio-only / Video-only / Text-only). Finally, we include a training-free modular…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The manuscript clearly identifies the problem and presents a well-defined motivation. 2. The proposed Daily-Omni benchmark contributes to advancing research in audio-visual reasoning. 3. The Daily-Omni QA Generation framework shows strong potential for extensibility and future development. 4. The writing is clear and well-organized, making the paper easy to read and understand.
1.It is recommended to discuss the differences between this work and related studies such as *MMAU*, *AURA*, and *OmniVideoBench*. 2. The generalization experiments for Daily-Omni are insufficient. It remains unclear whether the model demonstrates a truly generalizable multimodal reasoning ability or if its effectiveness is limited to the constructed dataset. 3. The validity of the results may largely stem from the inherent capabilities of the large models used. How are the spatio-temporal ass
- The paper is well-written and easy to understand, the designed pipeline is reasonably structured, and the performance achieved by the proposed Daily-Omni-Agent reflects the current limitations of existing MLLMs. - The comparative analysis across different models validates the importance of audio for video comprehension, thereby underscoring the necessity of developing an appropriate benchmark.
- I noticed that an earlier work, AVUT [1], s highly relevant to the research presented in this paper. The authors should provide a comparative analysis in the manuscript to clearly distinguish the contributions of this work from the prior study. - This benchmark's reliance on data drawn solely from existing datasets is problematic for two reasons. 1) It potentially undermines the validity of the test, as the data may have been seen or is out of its original context. 2) It bypasses the critical,
1. Compared with WorldSense, this paper introduces a new benchmark dataset, an automated QA generation pipeline, and a training-free agent. Although the improvements are incremental, they hold practical significance within the field of multimodal research. 2. The proposed Daily-Omni benchmark covers diverse daily-life scenarios from multiple sources, including music, speech, and various environmental sound events. It also provides a complementary QA generation pipeline, offering a useful tool fo
1. The paper presents a multi-stage data annotation and QA construction process based on Gemini 2.0 Flash and Deepseek-R1, which, while complete, functions more as a systematic workflow connected primarily through prompt engineering. (1) Although each 30-second video is divided into three 10-second segments and each 60-second video into three 20-second segments, the entire process still relies heavily on Gemini 2.0 Flash’s interpretation of these clips. The method for aligning visual and audio
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
