An Encoder-Decoder Based Audio Captioning System With Transfer and   Reinforcement Learning

Xinhao Mei; Qiushi Huang; Xubo Liu; Gengyun Chen; Jingqian Wu; Yusong; Wu; Jinzheng Zhao; Shengchen Li; Tom Ko; H Lilian Tang; Xi Shao; Mark D.; Plumbley; Wenwu Wang

arXiv:2108.02752·eess.AS·August 6, 2021·20 cites

An Encoder-Decoder Based Audio Captioning System With Transfer and Reinforcement Learning

Xinhao Mei, Qiushi Huang, Xubo Liu, Gengyun Chen, Jingqian Wu, Yusong, Wu, Jinzheng Zhao, Shengchen Li, Tom Ko, H Lilian Tang, Xi Shao, Mark D., Plumbley, Wenwu Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces an encoder-decoder audio captioning system enhanced with transfer learning and reinforcement learning, improving evaluation metrics but with mixed effects on caption quality.

Contribution

The study proposes a novel encoder-decoder architecture for audio captioning that incorporates transfer learning and reinforcement learning to address data scarcity and metric optimization.

Findings

01

System ranked 3rd in DCASE 2021 Task 6

02

Transfer learning significantly improves performance

03

Reinforcement learning enhances metric scores but may affect caption quality

Abstract

Automated audio captioning aims to use natural language to describe the content of audio data. This paper presents an audio captioning system with an encoder-decoder architecture, where the decoder predicts words based on audio features extracted by the encoder. To improve the proposed system, transfer learning from either an upstream audio-related task or a large in-domain dataset is introduced to mitigate the problem induced by data scarcity. Besides, evaluation metrics are incorporated into the optimization of the model with reinforcement learning, which helps address the problem of ``exposure bias'' induced by ``teacher forcing'' training strategy and the mismatch between the evaluation metrics and the loss function. The resulting system was ranked 3rd in DCASE 2021 Task 6. Ablation studies are carried out to investigate how much each element in the proposed system can contribute to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

XinhaoMei/DCASE2021_task6_v2
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis