Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models
Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li,, Lu Lu, Zejun Ma, Chao Zhang

TL;DR
This paper introduces FAVOR, a framework for fine-grained audio-visual joint representations in multimodal large language models, enhancing video understanding and reasoning capabilities.
Contribution
It proposes a novel causal Q-Former structure for aligning audio-visual features at the frame level and introduces AVEB, a benchmark for evaluating multimodal reasoning.
Findings
FAVOR improves over 20% accuracy on video question-answering tasks.
Achieves competitive performance on single-modal audio, speech, and image tasks.
Demonstrates advanced video comprehension and reasoning abilities.
Abstract
Audio-visual large language models (LLM) have drawn significant attention, yet the fine-grained combination of both input streams is rather under-explored, which is challenging but necessary for LLMs to understand general video inputs. To this end, a fine-grained audio-visual joint representation (FAVOR) learning framework for multimodal LLMs is proposed in this paper, which extends a text-based LLM to simultaneously perceive speech and audio events in the audio input stream and images or videos in the visual input stream, at the frame level. To fuse the audio and visual feature streams into joint representations and to align the joint space with the LLM input embedding space, we propose a causal Q-Former structure with a causal attention module to enhance the capture of causal relations of the audio-visual frames across time. An audio-visual evaluation benchmark (AVEB) is also proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Multimodal Machine Learning Applications · Speech and Audio Processing
MethodsALIGN
