video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models
Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li,, Lu Lu, Zejun Ma, Yuxuan Wang, Chao Zhang

TL;DR
video-SALMONN is an end-to-end audio-visual large language model that effectively understands speech, visual, and audio elements in videos, significantly improving video question-answering accuracy and reasoning capabilities.
Contribution
The paper introduces a novel multi-resolution causal Q-Former and training schemes for integrated speech, audio, and visual understanding in av-LLMs, advancing video comprehension.
Findings
Achieves over 25% accuracy improvement on video-QA tasks.
Surpasses 30% accuracy on audio-visual QA with speech.
Demonstrates advanced reasoning abilities in video understanding.
Abstract
Speech understanding as an element of the more generic video understanding using audio-visual large language models (av-LLMs) is a crucial yet understudied aspect. This paper proposes video-SALMONN, a single end-to-end av-LLM for video processing, which can understand not only visual frame sequences, audio events and music, but speech as well. To obtain fine-grained temporal information required by speech understanding, while keeping efficient for other video elements, this paper proposes a novel multi-resolution causal Q-Former (MRC Q-Former) structure to connect pre-trained audio-visual encoders and the backbone large language model. Moreover, dedicated training approaches including the diversity loss and the unpaired audio-visual mixed training scheme are proposed to avoid frames or modality dominance. On the introduced speech-audio-visual evaluation benchmark, video-SALMONN achieves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing
