An Eye for an Ear: Zero-shot Audio Description Leveraging an Image Captioner using Audiovisual Distribution Alignment
Hugo Malard, Michel Olvera, St\'ephane Lathuiliere, Slim Essid

TL;DR
This paper presents a novel unsupervised method that aligns audio and image token distributions in multimodal models, enabling zero-shot audio captioning by repurposing image captioners for auditory content without additional training.
Contribution
The work introduces a new distribution alignment technique that bridges the modality gap, allowing existing image captioning models to perform zero-shot audio captioning in an unsupervised manner.
Findings
Significantly improved zero-shot audio captioning performance.
Effective distribution alignment between audio and image tokens.
Unaltered image captioning models can be repurposed for audio description.
Abstract
Multimodal large language models have fueled progress in image captioning. These models, fine-tuned on vast image datasets, exhibit a deep understanding of semantic concepts. In this work, we show that this ability can be re-purposed for audio captioning, where the joint image-language decoder can be leveraged to describe auditory content associated with image sequences within videos featuring audiovisual content. This can be achieved via multimodal alignment. Yet, this multimodal alignment task is non-trivial due to the inherent disparity between audible and visible elements in real-world videos. Moreover, multimodal representation learning often relies on contrastive learning, facing the challenge of the so-called modality gap which hinders smooth integration between modalities. In this work, we introduce a novel methodology for bridging the audiovisual modality gap by matching the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Subtitles and Audiovisual Media · Cancer-related molecular mechanisms research
