video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models

Guangzhi Sun; Wenyi Yu; Changli Tang; Xianzhao Chen; Tian Tan; Wei Li,; Lu Lu; Zejun Ma; Yuxuan Wang; Chao Zhang

arXiv:2406.15704·cs.CV·June 25, 2024

video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models

Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li,, Lu Lu, Zejun Ma, Yuxuan Wang, Chao Zhang

PDF

Open Access 1 Models

TL;DR

video-SALMONN is an end-to-end audio-visual large language model that effectively understands speech, visual, and audio elements in videos, significantly improving video question-answering accuracy and reasoning capabilities.

Contribution

The paper introduces a novel multi-resolution causal Q-Former and training schemes for integrated speech, audio, and visual understanding in av-LLMs, advancing video comprehension.

Findings

01

Achieves over 25% accuracy improvement on video-QA tasks.

02

Surpasses 30% accuracy on audio-visual QA with speech.

03

Demonstrates advanced reasoning abilities in video understanding.

Abstract

Speech understanding as an element of the more generic video understanding using audio-visual large language models (av-LLMs) is a crucial yet understudied aspect. This paper proposes video-SALMONN, a single end-to-end av-LLM for video processing, which can understand not only visual frame sequences, audio events and music, but speech as well. To obtain fine-grained temporal information required by speech understanding, while keeping efficient for other video elements, this paper proposes a novel multi-resolution causal Q-Former (MRC Q-Former) structure to connect pre-trained audio-visual encoders and the backbone large language model. Moreover, dedicated training approaches including the diversity loss and the unpaired audio-visual mixed training scheme are proposed to avoid frames or modality dominance. On the introduced speech-audio-visual evaluation benchmark, video-SALMONN achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
tsinghua-ee/SALMONN
model· ♡ 51
♡ 51

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing