Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation
Shih-Lun Wu, Xuankai Chang, Gordon Wichern, Jee-weon Jung,, Fran\c{c}ois Germain, Jonathan Le Roux, Shinji Watanabe

TL;DR
This paper advances automated audio captioning by integrating fine-grained audio features, text embedding supervision, and innovative data augmentation using large language models, resulting in state-of-the-art performance.
Contribution
It introduces a novel combination of pretrained models, LLM-based caption mix-up augmentation, and hybrid inference techniques to improve AAC models.
Findings
Achieved a new state-of-the-art 32.6 SPIDEr-FL score on Clotho.
Won the 2023 DCASE AAC challenge.
Enhanced training data diversity and model performance.
Abstract
Automated audio captioning (AAC) aims to generate informative descriptions for various sounds from nature and/or human activities. In recent years, AAC has quickly attracted research interest, with state-of-the-art systems now relying on a sequence-to-sequence (seq2seq) backbone powered by strong models such as Transformers. Following the macro-trend of applied machine learning research, in this work, we strive to improve the performance of seq2seq AAC models by extensively leveraging pretrained models and large language models (LLMs). Specifically, we utilize BEATs to extract fine-grained audio features. Then, we employ Instructor LLM to fetch text embeddings of captions, and infuse their language-modality knowledge into BEATs audio features via an auxiliary InfoNCE loss function. Moreover, we propose a novel data augmentation method that uses ChatGPT to produce caption mix-ups (i.e.,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Subtitles and Audiovisual Media · Speech Recognition and Synthesis
