An Eye for an Ear: Zero-shot Audio Description Leveraging an Image   Captioner using Audiovisual Distribution Alignment

Hugo Malard; Michel Olvera; St\'ephane Lathuiliere; Slim Essid

arXiv:2410.05997·eess.AS·October 10, 2024

An Eye for an Ear: Zero-shot Audio Description Leveraging an Image Captioner using Audiovisual Distribution Alignment

Hugo Malard, Michel Olvera, St\'ephane Lathuiliere, Slim Essid

PDF

Open Access 1 Repo

TL;DR

This paper presents a novel unsupervised method that aligns audio and image token distributions in multimodal models, enabling zero-shot audio captioning by repurposing image captioners for auditory content without additional training.

Contribution

The work introduces a new distribution alignment technique that bridges the modality gap, allowing existing image captioning models to perform zero-shot audio captioning in an unsupervised manner.

Findings

01

Significantly improved zero-shot audio captioning performance.

02

Effective distribution alignment between audio and image tokens.

03

Unaltered image captioning models can be repurposed for audio description.

Abstract

Multimodal large language models have fueled progress in image captioning. These models, fine-tuned on vast image datasets, exhibit a deep understanding of semantic concepts. In this work, we show that this ability can be re-purposed for audio captioning, where the joint image-language decoder can be leveraged to describe auditory content associated with image sequences within videos featuring audiovisual content. This can be achieved via multimodal alignment. Yet, this multimodal alignment task is non-trivial due to the inherent disparity between audible and visible elements in real-world videos. Moreover, multimodal representation learning often relies on contrastive learning, facing the challenge of the so-called modality gap which hinders smooth integration between modalities. In this work, we introduce a novel methodology for bridging the audiovisual modality gap by matching the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hugomalard/aneyeforanear
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Subtitles and Audiovisual Media · Cancer-related molecular mechanisms research