Fine-grained Audio-Visual Joint Representations for Multimodal Large   Language Models

Guangzhi Sun; Wenyi Yu; Changli Tang; Xianzhao Chen; Tian Tan; Wei Li,; Lu Lu; Zejun Ma; Chao Zhang

arXiv:2310.05863·eess.AS·October 11, 2023·5 cites

Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models

Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li,, Lu Lu, Zejun Ma, Chao Zhang

PDF

Open Access 2 Repos

TL;DR

This paper introduces FAVOR, a framework for fine-grained audio-visual joint representations in multimodal large language models, enhancing video understanding and reasoning capabilities.

Contribution

It proposes a novel causal Q-Former structure for aligning audio-visual features at the frame level and introduces AVEB, a benchmark for evaluating multimodal reasoning.

Findings

01

FAVOR improves over 20% accuracy on video question-answering tasks.

02

Achieves competitive performance on single-modal audio, speech, and image tasks.

03

Demonstrates advanced video comprehension and reasoning abilities.

Abstract

Audio-visual large language models (LLM) have drawn significant attention, yet the fine-grained combination of both input streams is rather under-explored, which is challenging but necessary for LLMs to understand general video inputs. To this end, a fine-grained audio-visual joint representation (FAVOR) learning framework for multimodal LLMs is proposed in this paper, which extends a text-based LLM to simultaneously perceive speech and audio events in the audio input stream and images or videos in the visual input stream, at the frame level. To fuse the audio and visual feature streams into joint representations and to align the joint space with the LLM input embedding space, we propose a causal Q-Former structure with a causal attention module to enhance the capture of causal relations of the audio-visual frames across time. An audio-visual evaluation benchmark (AVEB) is also proposed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Multimodal Machine Learning Applications · Speech and Audio Processing

MethodsALIGN