Performance Improvement of Language-Queried Audio Source Separation   Based on Caption Augmentation From Large Language Models for DCASE Challenge   2024 Task 9

Do Hyun Lee; Yoonah Song; Hong Kook Kim

arXiv:2406.11248·eess.AS·November 28, 2024

Performance Improvement of Language-Queried Audio Source Separation Based on Caption Augmentation From Large Language Models for DCASE Challenge 2024 Task 9

Do Hyun Lee, Yoonah Song, Hong Kook Kim

PDF

Open Access

TL;DR

This paper introduces a prompt-engineering-based caption augmentation method using large language models to improve language-queried audio source separation performance, demonstrated on the DCASE 2024 Task 9 dataset.

Contribution

It presents a novel LLM-based caption augmentation technique that enhances LASS performance, with optimized prompts for effective caption generation.

Findings

01

Caption augmentation improves LASS accuracy

02

Optimized prompts yield better caption quality

03

Enhanced performance on DCASE 2024 validation set

Abstract

We present a prompt-engineering-based text-augmentation approach applied to a language-queried audio source separation (LASS) task. To enhance the performance of LASS, the proposed approach utilizes large language models (LLMs) to generate multiple captions corresponding to each sentence of the training dataset. To this end, we first perform experiments to identify the most effective prompts for caption augmentation with a smaller number of captions. A LASS model trained with these augmented captions demonstrates improved performance on the DCASE 2024 Task 9 validation set compared to that trained without augmentation. This study highlights the effectiveness of LLM-based caption augmentation in advancing language-queried audio source separation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis

MethodsSparse Evolutionary Training