E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding

Ye Liu; Zongyang Ma; Zhongang Qi; Yang Wu; Ying Shan; Chang Wen Chen

arXiv:2409.18111·cs.CV·September 27, 2024

E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding

Ye Liu, Zongyang Ma, Zhongang Qi, Yang Wu, Ying Shan, Chang Wen Chen

PDF

Open Access 1 Repo 4 Models 2 Datasets 1 Video

TL;DR

E.T. Bench is a comprehensive, large-scale benchmark designed to evaluate open-ended, event-level video understanding, revealing current model limitations and proposing a new baseline with improved fine-grained comprehension.

Contribution

The paper introduces E.T. Bench, a novel benchmark for detailed event-level video understanding, and proposes E.T. Chat, a baseline model with an instruction-tuning dataset for enhanced performance.

Findings

01

State-of-the-art models struggle with fine-grained event grounding.

02

Short video context length hampers detailed understanding.

03

Instruction-tuned models outperform existing approaches.

Abstract

Recent advances in Video Large Language Models (Video-LLMs) have demonstrated their great potential in general-purpose video understanding. To verify the significance of these models, a number of benchmarks have been proposed to diagnose their capabilities in different scenarios. However, existing benchmarks merely evaluate models through video-level question-answering, lacking fine-grained event-level assessment and task diversity. To fill this gap, we introduce E.T. Bench (Event-Level & Time-Sensitive Video Understanding Benchmark), a large-scale and high-quality benchmark for open-ended event-level video understanding. Categorized within a 3-level task taxonomy, E.T. Bench encompasses 7.3K samples under 12 tasks with 7K videos (251.4h total length) under 8 domains, providing comprehensive evaluations. We extensively evaluated 8 Image-LLMs and 12 Video-LLMs on our benchmark, and the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

PolyU-ChenLab/ETBench
pytorchOfficial

Models

Datasets

Videos

E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding· slideslive

Taxonomy

TopicsVideo Analysis and Summarization · Advanced Data Compression Techniques · Digital Filter Design and Implementation