Towards Event-oriented Long Video Understanding
Yifan Du, Kun Zhou, Yuqi Huo, Yifan Li, Wayne Xin Zhao, Haoyu Lu,, Zijia Zhao, Bingning Wang, Weipeng Chen, Ji-Rong Wen

TL;DR
This paper introduces Event-Bench, a comprehensive long video understanding benchmark focused on events, and proposes Video Instruction Merging (VIM), a method that significantly improves multimodal large language models' performance on event comprehension tasks.
Contribution
The paper presents Event-Bench, a new event-oriented long video benchmark, and VIM, a cost-effective method to enhance video MLLMs with event-rich instructions, outperforming existing models.
Findings
GPT-4o achieves 53.33% accuracy on Event-Bench.
VIM surpasses state-of-the-art open-source models and GPT-4V.
Event-Bench enables comprehensive evaluation of video event understanding.
Abstract
With the rapid development of video Multimodal Large Language Models (MLLMs), numerous benchmarks have been proposed to assess their video understanding capability. However, due to the lack of rich events in the videos, these datasets may suffer from the short-cut bias that the answers can be deduced from a few frames, without the need to watch the entire video. To address this issue, we introduce Event-Bench, an event-oriented long video understanding benchmark built on existing datasets and human annotations. Event-Bench includes six event-related tasks and 2,190 test instances to comprehensively evaluate video event understanding ability. Additionally, we propose Video Instruction Merging~(VIM), a cost-effective method that enhances video MLLMs using merged, event-intensive video instructions, addressing the scarcity of human-annotated, event-intensive data. Extensive experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Medical Imaging Techniques and Applications · Advanced Vision and Imaging
