Improving Audio Captioning Models with Fine-grained Audio Features, Text   Embedding Supervision, and LLM Mix-up Augmentation

Shih-Lun Wu; Xuankai Chang; Gordon Wichern; Jee-weon Jung,; Fran\c{c}ois Germain; Jonathan Le Roux; Shinji Watanabe

arXiv:2309.17352·cs.SD·January 11, 2024

Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation

Shih-Lun Wu, Xuankai Chang, Gordon Wichern, Jee-weon Jung,, Fran\c{c}ois Germain, Jonathan Le Roux, Shinji Watanabe

PDF

Open Access 1 Repo 1 Models 1 Datasets

TL;DR

This paper advances automated audio captioning by integrating fine-grained audio features, text embedding supervision, and innovative data augmentation using large language models, resulting in state-of-the-art performance.

Contribution

It introduces a novel combination of pretrained models, LLM-based caption mix-up augmentation, and hybrid inference techniques to improve AAC models.

Findings

01

Achieved a new state-of-the-art 32.6 SPIDEr-FL score on Clotho.

02

Won the 2023 DCASE AAC challenge.

03

Enhanced training data diversity and model performance.

Abstract

Automated audio captioning (AAC) aims to generate informative descriptions for various sounds from nature and/or human activities. In recent years, AAC has quickly attracted research interest, with state-of-the-art systems now relying on a sequence-to-sequence (seq2seq) backbone powered by strong models such as Transformers. Following the macro-trend of applied machine learning research, in this work, we strive to improve the performance of seq2seq AAC models by extensively leveraging pretrained models and large language models (LLMs). Specifically, we utilize BEATs to extract fine-grained audio features. Then, we employ Instructor LLM to fetch text embeddings of captions, and infuse their language-modality knowledge into BEATs audio features via an auxiliary InfoNCE loss function. Moreover, we propose a novel data augmentation method that uses ChatGPT to produce caption mix-ups (i.e.,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

slseanwu/beats-conformer-bart-audio-captioner
pytorchOfficial

Models

🤗
slseanwu/beats-conformer-bart-audio-captioner
model· 6 dl· ♡ 6
6 dl♡ 6

Datasets

slseanwu/clotho-chatgpt-mixup-50K
dataset· 33 dl
33 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Subtitles and Audiovisual Media · Speech Recognition and Synthesis