Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time
Sanjoy Chowdhury, Sayan Nag, Subhrajyoti Dasgupta, Jun Chen, Mohamed, Elhoseiny, Ruohan Gao, Dinesh Manocha

TL;DR
Meerkat is a novel audio-visual large language model that achieves fine-grained spatial and temporal understanding of images and audio, enabling it to perform complex grounding and localization tasks with state-of-the-art accuracy.
Contribution
The paper introduces Meerkat, a new multi-modal LLM with a novel modality alignment module and a large curated dataset, advancing fine-grained audio-visual understanding.
Findings
State-of-the-art performance on multiple audio-visual tasks
Up to 37.12% relative improvement over previous methods
Effective spatial and temporal grounding in audio-visual data
Abstract
Leveraging Large Language Models' remarkable proficiency in text-based tasks, recent works on Multi-modal LLMs (MLLMs) extend them to other modalities like vision and audio. However, the progress in these directions has been mostly focused on tasks that only require a coarse-grained understanding of the audio-visual semantics. We present Meerkat, an audio-visual LLM equipped with a fine-grained understanding of image and audio both spatially and temporally. With a new modality alignment module based on optimal transport and a cross-attention module that enforces audio-visual consistency, Meerkat can tackle challenging tasks such as audio referred image grounding, image guided audio temporal localization, and audio-visual fact-checking. Moreover, we carefully curate a large dataset AVFIT that comprises 3M instruction tuning samples collected from open-source datasets, and introduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods
MethodsSoftmax · Concatenated Skip Connection
