Automated Audio Captioning: An Overview of Recent Progress and New   Challenges

Xinhao Mei; Xubo Liu; Mark D. Plumbley; Wenwu Wang

arXiv:2205.05949·eess.AS·September 28, 2022

Automated Audio Captioning: An Overview of Recent Progress and New Challenges

Xinhao Mei, Xubo Liu, Mark D. Plumbley, Wenwu Wang

PDF

TL;DR

This paper provides a comprehensive overview of recent advances in automated audio captioning, highlighting new challenges, datasets, and evaluation methods in this rapidly evolving cross-modal translation task.

Contribution

It offers a detailed review of existing approaches, datasets, and evaluation metrics, and discusses open challenges and future research directions in automated audio captioning.

Findings

01

Deep learning dominates current approaches.

02

Auxiliary information improves caption quality.

03

Evaluation metrics vary widely and need standardization.

Abstract

Automated audio captioning is a cross-modal translation task that aims to generate natural language descriptions for given audio clips. This task has received increasing attention with the release of freely available datasets in recent years. The problem has been addressed predominantly with deep learning techniques. Numerous approaches have been proposed, such as investigating different neural network architectures, exploiting auxiliary information such as keywords or sentence information to guide caption generation, and employing different training strategies, which have greatly facilitated the development of this field. In this paper, we present a comprehensive review of the published contributions in automated audio captioning, from a variety of existing approaches to evaluation metrics and datasets. We also discuss open challenges and envisage possible future research directions.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.