Zero-Shot Audio Captioning via Audibility Guidance
Tal Shaharabany, Ariel Shaulov, Lior Wolf

TL;DR
This paper introduces a zero-shot audio captioning approach that leverages audibility guidance through three networks, improving caption quality without training on caption data.
Contribution
It proposes a novel zero-shot method for audio captioning using three networks to ensure fluency, faithfulness, and audibility, without requiring training on caption datasets.
Findings
Audibility guidance significantly improves captioning performance.
The method outperforms baseline models lacking audibility considerations.
Using GPT-4 for dataset creation enables effective training of the classifier.
Abstract
The task of audio captioning is similar in essence to tasks such as image and video captioning. However, it has received much less attention. We propose three desiderata for captioning audio -- (i) fluency of the generated text, (ii) faithfulness of the generated text to the input audio, and the somewhat related (iii) audibility, which is the quality of being able to be perceived based only on audio. Our method is a zero-shot method, i.e., we do not learn to perform captioning. Instead, captioning occurs as an inference process that involves three networks that correspond to the three desired qualities: (i) A Large Language Model, in our case, for reasons of convenience, GPT-2, (ii) A model that provides a matching score between an audio file and a text, for which we use a multimodal matching network called ImageBind, and (iii) A text classifier, trained using a dataset we collected…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Music and Audio Processing · Subtitles and Audiovisual Media
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Attention Dropout · Residual Connection · Discriminative Fine-Tuning · Adam · Weight Decay · Cosine Annealing
