Audio-Visual LLM for Video Understanding
Fangxun Shu, Lei Zhang, Hao Jiang, Cihang Xie

TL;DR
This paper introduces Audio-Visual LLM, a multimodal large language model that integrates visual and auditory data for comprehensive video understanding, demonstrating strong zero-shot performance on various tasks.
Contribution
The paper presents a novel modality-augmented training approach and a high-quality video instruction dataset, enabling end-to-end joint training for multimodal video understanding.
Findings
Achieves 53.7% accuracy on MSRVTT-QA, surpassing previous models.
Demonstrates strong zero-shot performance across multiple video understanding tasks.
Performs competitively on audio-specific tasks like AudioCaps.
Abstract
This paper presents Audio-Visual LLM, a Multimodal Large Language Model that takes both visual and auditory inputs for holistic video understanding. A key design is the modality-augmented training, which involves the integration of modality-specific tokens engineered to activate the appropriate visual and/or auditory encoder selectively. This mechanism is pivotal in enabling end-to-end joint training with video data at different modalities, including visual-only, audio-only, and audio-visual formats. Moreover, we introduce a high-quality video instruction dataset, derived from GPT-4. This dataset allows Audio-Visual LLM to adeptly process a variety of task-oriented video instructions, ranging from multi-turn conversations and audio-visual narratives to complex reasoning tasks. Extensive experiments demonstrate that Audio-Visual LLM impressively achieves strong zero-shot results across a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Dropout · Layer Normalization · Dense Connections · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing
