Multilingual Audio Captioning using machine translated data
Mat\'eo Cousin, \'Etienne Labb\'e, Thomas Pellegrini

TL;DR
This paper explores multilingual automated audio captioning by translating datasets into multiple languages, training monolingual and multilingual models, and demonstrating the effectiveness of building native language systems over translation-based approaches.
Contribution
It introduces a multilingual AAC approach using machine translation of datasets and develops a multilingual model that performs comparably to monolingual systems with fewer parameters.
Findings
Monolingual systems achieved about 75% CIDEr on AudioCaps and 43% on Clotho.
French manual captions improved results over translated outputs, supporting native language system development.
Multilingual model performed comparably to monolingual models with fewer parameters.
Abstract
Automated Audio Captioning (AAC) systems attempt to generate a natural language sentence, a caption, that describes the content of an audio recording, in terms of sound events. Existing datasets provide audio-caption pairs, with captions written in English only. In this work, we explore multilingual AAC, using machine translated captions. We translated automatically two prominent AAC datasets, AudioCaps and Clotho, from English to French, German and Spanish. We trained and evaluated monolingual systems in the four languages, on AudioCaps and Clotho. In all cases, the models achieved similar performance, about 75% CIDEr on AudioCaps and 43% on Clotho. In French, we acquired manual captions of the AudioCaps eval subset. The French system, trained on the machine translated version of AudioCaps, achieved significantly better results on the manual eval subset, compared to the English system…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Multimodal Machine Learning Applications · Subtitles and Audiovisual Media
