Multilingual Audio Captioning using machine translated data

Mat\'eo Cousin; \'Etienne Labb\'e; Thomas Pellegrini

arXiv:2309.07615·cs.SD·September 15, 2023

Multilingual Audio Captioning using machine translated data

Mat\'eo Cousin, \'Etienne Labb\'e, Thomas Pellegrini

PDF

Open Access

TL;DR

This paper explores multilingual automated audio captioning by translating datasets into multiple languages, training monolingual and multilingual models, and demonstrating the effectiveness of building native language systems over translation-based approaches.

Contribution

It introduces a multilingual AAC approach using machine translation of datasets and develops a multilingual model that performs comparably to monolingual systems with fewer parameters.

Findings

01

Monolingual systems achieved about 75% CIDEr on AudioCaps and 43% on Clotho.

02

French manual captions improved results over translated outputs, supporting native language system development.

03

Multilingual model performed comparably to monolingual models with fewer parameters.

Abstract

Automated Audio Captioning (AAC) systems attempt to generate a natural language sentence, a caption, that describes the content of an audio recording, in terms of sound events. Existing datasets provide audio-caption pairs, with captions written in English only. In this work, we explore multilingual AAC, using machine translated captions. We translated automatically two prominent AAC datasets, AudioCaps and Clotho, from English to French, German and Spanish. We trained and evaluated monolingual systems in the four languages, on AudioCaps and Clotho. In all cases, the models achieved similar performance, about 75% CIDEr on AudioCaps and 43% on Clotho. In French, we acquired manual captions of the AudioCaps eval subset. The French system, trained on the machine translated version of AudioCaps, achieved significantly better results on the manual eval subset, compared to the English system…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Multimodal Machine Learning Applications · Subtitles and Audiovisual Media