Towards Event-oriented Long Video Understanding

Yifan Du; Kun Zhou; Yuqi Huo; Yifan Li; Wayne Xin Zhao; Haoyu Lu,; Zijia Zhao; Bingning Wang; Weipeng Chen; Ji-Rong Wen

arXiv:2406.14129·cs.CV·June 21, 2024

Towards Event-oriented Long Video Understanding

Yifan Du, Kun Zhou, Yuqi Huo, Yifan Li, Wayne Xin Zhao, Haoyu Lu,, Zijia Zhao, Bingning Wang, Weipeng Chen, Ji-Rong Wen

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces Event-Bench, a comprehensive long video understanding benchmark focused on events, and proposes Video Instruction Merging (VIM), a method that significantly improves multimodal large language models' performance on event comprehension tasks.

Contribution

The paper presents Event-Bench, a new event-oriented long video benchmark, and VIM, a cost-effective method to enhance video MLLMs with event-rich instructions, outperforming existing models.

Findings

01

GPT-4o achieves 53.33% accuracy on Event-Bench.

02

VIM surpasses state-of-the-art open-source models and GPT-4V.

03

Event-Bench enables comprehensive evaluation of video event understanding.

Abstract

With the rapid development of video Multimodal Large Language Models (MLLMs), numerous benchmarks have been proposed to assess their video understanding capability. However, due to the lack of rich events in the videos, these datasets may suffer from the short-cut bias that the answers can be deduced from a few frames, without the need to watch the entire video. To address this issue, we introduce Event-Bench, an event-oriented long video understanding benchmark built on existing datasets and human annotations. Event-Bench includes six event-related tasks and 2,190 test instances to comprehensively evaluate video event understanding ability. Additionally, we propose Video Instruction Merging~(VIM), a cost-effective method that enhances video MLLMs using merged, event-intensive video instructions, addressing the scarcity of human-annotated, event-intensive data. Extensive experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rucaibox/event-bench
noneOfficial

Datasets

RUCAIBox/Event-Bench
dataset· 113 dl
113 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Medical Imaging Techniques and Applications · Advanced Vision and Imaging