SAVEn-Vid: Synergistic Audio-Visual Integration for Enhanced   Understanding in Long Video Context

Jungang Li; Sicheng Tao; Yibo Yan; Xiaojie Gu; Haodong Xu; Xu Zheng,; Yuanhuiyi Lyu; Linfeng Zhang; Xuming Hu

arXiv:2411.16213·cs.CV·December 12, 2024

SAVEn-Vid: Synergistic Audio-Visual Integration for Enhanced Understanding in Long Video Context

Jungang Li, Sicheng Tao, Yibo Yan, Xiaojie Gu, Haodong Xu, Xu Zheng,, Yuanhuiyi Lyu, Linfeng Zhang, Xuming Hu

PDF

Open Access

TL;DR

This paper introduces SAVEn-Vid, a large audio-visual dataset and a time-aware AV-LLM model, SAVEnVideo, to improve understanding of long videos by effectively integrating audio-visual information, outperforming existing models.

Contribution

The paper presents the first long audio-visual dataset SAVEn-Vid, a novel AV-LLM SAVEnVideo, and a benchmark AVBench for evaluating audio-visual comprehension in long videos.

Findings

01

SAVEnVideo outperforms existing Video-LLMs by 3.61% on Video-MME.

02

SAVEnVideo surpasses leading audio-visual LLM by 1.29% on Music-AVQA.

03

SAVEn-Vid enables better integration of audio-visual info in long video understanding.

Abstract

Endeavors have been made to explore Large Language Models for video analysis (Video-LLMs), particularly in understanding and interpreting long videos. However, existing Video-LLMs still face challenges in effectively integrating the rich and diverse audio-visual information inherent in long videos, which is crucial for comprehensive understanding. This raises the question: how can we leverage embedded audio-visual information to enhance long video understanding? Therefore, (i) we introduce SAVEn-Vid, the first-ever long audio-visual video dataset comprising over 58k audio-visual instructions. (ii) From the model perspective, we propose a time-aware Audio-Visual Large Language Model (AV-LLM), SAVEnVideo, fine-tuned on SAVEn-Vid. (iii) Besides, we present AVBench, a benchmark containing 2,500 QAs designed to evaluate models on enhanced audio-visual comprehension tasks within long video,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and Audio Processing