Parameter Efficient Audio Captioning With Faithful Guidance Using   Audio-text Shared Latent Representation

Arvind Krishna Sridhar; Yinyi Guo; Erik Visser; Rehana Mahfuz

arXiv:2309.03340·cs.CL·September 8, 2023

Parameter Efficient Audio Captioning With Faithful Guidance Using Audio-text Shared Latent Representation

Arvind Krishna Sridhar, Yinyi Guo, Erik Visser, Rehana Mahfuz

PDF

Open Access

TL;DR

This paper introduces a parameter-efficient audio captioning method that uses shared latent representations and faithful guidance to produce accurate captions with less model complexity, suitable for edge deployment.

Contribution

It proposes a novel faithful decoding algorithm leveraging shared latent space and similarity metrics, reducing model size while maintaining performance.

Findings

01

Achieves comparable performance to larger models on benchmark datasets.

02

Effectively detects hallucinated captions using shared latent space similarity.

03

Reduces model complexity without sacrificing caption quality.

Abstract

There has been significant research on developing pretrained transformer architectures for multimodal-to-text generation tasks. Albeit performance improvements, such models are frequently overparameterized, hence suffer from hallucination and large memory footprint making them challenging to deploy on edge devices. In this paper, we address both these issues for the application of automated audio captioning. First, we propose a data augmentation technique for generating hallucinated audio captions and show that similarity based on an audio-text shared latent space is suitable for detecting hallucination. Then, we propose a parameter efficient inference time faithful decoding algorithm that enables smaller audio captioning models with performance equivalent to larger models trained with more data. During the beam decoding step, the smaller model utilizes an audio-text shared latent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Video Analysis and Summarization

MethodsALIGN