SAVEn-Vid: Synergistic Audio-Visual Integration for Enhanced Understanding in Long Video Context
Jungang Li, Sicheng Tao, Yibo Yan, Xiaojie Gu, Haodong Xu, Xu Zheng,, Yuanhuiyi Lyu, Linfeng Zhang, Xuming Hu

TL;DR
This paper introduces SAVEn-Vid, a large audio-visual dataset and a time-aware AV-LLM model, SAVEnVideo, to improve understanding of long videos by effectively integrating audio-visual information, outperforming existing models.
Contribution
The paper presents the first long audio-visual dataset SAVEn-Vid, a novel AV-LLM SAVEnVideo, and a benchmark AVBench for evaluating audio-visual comprehension in long videos.
Findings
SAVEnVideo outperforms existing Video-LLMs by 3.61% on Video-MME.
SAVEnVideo surpasses leading audio-visual LLM by 1.29% on Music-AVQA.
SAVEn-Vid enables better integration of audio-visual info in long video understanding.
Abstract
Endeavors have been made to explore Large Language Models for video analysis (Video-LLMs), particularly in understanding and interpreting long videos. However, existing Video-LLMs still face challenges in effectively integrating the rich and diverse audio-visual information inherent in long videos, which is crucial for comprehensive understanding. This raises the question: how can we leverage embedded audio-visual information to enhance long video understanding? Therefore, (i) we introduce SAVEn-Vid, the first-ever long audio-visual video dataset comprising over 58k audio-visual instructions. (ii) From the model perspective, we propose a time-aware Audio-Visual Large Language Model (AV-LLM), SAVEnVideo, fine-tuned on SAVEn-Vid. (iii) Besides, we present AVBench, a benchmark containing 2,500 QAs designed to evaluate models on enhanced audio-visual comprehension tasks within long video,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and Audio Processing
