A Whisper transformer for audio captioning trained with synthetic   captions and transfer learning

Marek Kadl\v{c}\'ik; Adam H\'ajek; J\"urgen Kieslich; Rados{\l}aw; Winiecki

arXiv:2305.09690·cs.SD·May 18, 2023·6 cites

A Whisper transformer for audio captioning trained with synthetic captions and transfer learning

Marek Kadl\v{c}\'ik, Adam H\'ajek, J\"urgen Kieslich, Rados{\l}aw, Winiecki

PDF

Open Access 1 Repo 3 Models

TL;DR

This paper introduces an audio captioning approach using a pretrained Whisper transformer model trained with synthetic captions and transfer learning, showing how various training strategies affect performance.

Contribution

It presents a novel application of the Whisper transformer for audio captioning, leveraging synthetic data and transfer learning to improve results.

Findings

01

Different training strategies significantly impact model performance

02

Pretraining on synthetic captions enhances captioning accuracy

03

Model size and dataset mixture influence results

Abstract

The field of audio captioning has seen significant advancements in recent years, driven by the availability of large-scale audio datasets and advancements in deep learning techniques. In this technical report, we present our approach to audio captioning, focusing on the use of a pretrained speech-to-text Whisper model and pretraining on synthetic captions. We discuss our training procedures and present our experiments' results, which include model size variations, dataset mixtures, and other hyperparameters. Our findings demonstrate the impact of different training strategies on the performance of the audio captioning model. Our code and trained models are publicly available on GitHub and Hugging Face Hub.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

prompteus/audio-captioning
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis