LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware   Omni-Modal Perception of Long Videos

Tiantian Geng; Jinrui Zhang; Qingni Wang; Teng Wang; Jinming Duan,; Feng Zheng

arXiv:2411.19772·cs.CV·March 21, 2025

LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos

Tiantian Geng, Jinrui Zhang, Qingni Wang, Teng Wang, Jinming Duan,, Feng Zheng

PDF

Open Access 1 Repo 1 Datasets

TL;DR

LongVALE introduces a comprehensive benchmark for time-aware, multi-modal understanding of long videos, integrating vision, audio, and language with fine-grained event annotations to advance omni-modal perception.

Contribution

It is the first benchmark with 105K multi-modal events and precise temporal boundaries, enabling detailed omni-modal event understanding and LLM-based temporal video analysis.

Findings

01

Effective multi-modal event detection and captioning

02

Enhanced video understanding with fine-grained temporal annotations

03

Potential for advancing omni-modal video perception

Abstract

Despite impressive advancements in video understanding, most efforts remain limited to coarse-grained or visual-only video tasks. However, real-world videos encompass omni-modal information (vision, audio, and speech) with a series of events forming a cohesive storyline. The lack of multi-modal video data with fine-grained event annotations and the high cost of manual labeling are major obstacles to comprehensive omni-modality video perception. To address this gap, we propose an automatic pipeline consisting of high-quality multi-modal video filtering, semantically coherent omni-modal event boundary detection, and cross-modal correlation-aware event captioning. In this way, we present LongVALE, the first-ever Vision-Audio-Language Event understanding benchmark comprising 105K omni-modal events with precise temporal boundaries and detailed relation-aware captions within 8.4K high-quality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ttgeng233/LongVALE
pytorchOfficial

Datasets

ttgeng233/LongVALE
dataset· 197 dl
197 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Compression Techniques · Image and Video Quality Assessment · Video Coding and Compression Technologies