CAT: Enhancing Multimodal Large Language Model to Answer Questions in   Dynamic Audio-Visual Scenarios

Qilang Ye; Zitong Yu; Rui Shao; Xinyu Xie; Philip Torr; Xiaochun Cao

arXiv:2403.04640·cs.CV·March 8, 2024·1 cites

CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios

Qilang Ye, Zitong Yu, Rui Shao, Xinyu Xie, Philip Torr, Xiaochun Cao

PDF

Open Access 1 Repo

TL;DR

This paper introduces CAT, a novel enhancement for Multimodal Large Language Models, designed to improve question answering accuracy in complex dynamic audio-visual scenarios by aggregating clues, training on a specialized dataset, and optimizing for non-ambiguity responses.

Contribution

We propose CAT, which enhances MLLMs with a clue aggregator, a new audio-visual dataset, and a preference optimization strategy to better handle complex audio-visual questions.

Findings

01

CAT outperforms existing methods on AVQA tasks.

02

Enhanced ability to localize specific audio-visual objects.

03

Improved response clarity and reduced ambiguity.

Abstract

This paper focuses on the challenge of answering questions in scenarios that are composed of rich and complex dynamic audio-visual components. Although existing Multimodal Large Language Models (MLLMs) can respond to audio-visual content, these responses are sometimes ambiguous and fail to describe specific audio-visual events. To overcome this limitation, we introduce the CAT, which enhances MLLM in three ways: 1) besides straightforwardly bridging audio and video, we design a clue aggregator that aggregates question-related clues in dynamic audio-visual scenarios to enrich the detailed knowledge required for large language models. 2) CAT is trained on a mixed multimodal dataset, allowing direct application in audio-visual scenarios. Notably, we collect an audio-visual joint instruction dataset named AVinstruct, to further enhance the capacity of CAT to model cross-semantic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rikeilong/bay-cat
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems