Audio Difference Learning for Audio Captioning

Tatsuya Komatsu; Yusuke Fujita; Kazuya Takeda; Tomoki Toda

arXiv:2309.08141·eess.AS·September 18, 2023

Audio Difference Learning for Audio Captioning

Tatsuya Komatsu, Yusuke Fujita, Kazuya Takeda, Tomoki Toda

PDF

Open Access

TL;DR

This paper presents a new audio difference learning approach for audio captioning that leverages differential features between audio samples to generate more detailed captions, improving performance on standard datasets.

Contribution

The study introduces audio difference learning, a novel training paradigm that enhances audio captioning by focusing on differences between audio samples without requiring extra annotations.

Findings

01

Achieved a 7% improvement in SPIDEr score on Clotho and ESC50 datasets.

02

Demonstrated the effectiveness of differential features in capturing intricate audio details.

03

Proposed a mixing technique that simplifies captioning of audio differences.

Abstract

This study introduces a novel training paradigm, audio difference learning, for improving audio captioning. The fundamental concept of the proposed learning method is to create a feature representation space that preserves the relationship between audio, enabling the generation of captions that detail intricate audio information. This method employs a reference audio along with the input audio, both of which are transformed into feature representations via a shared encoder. Captions are then generated from these differential features to describe their differences. Furthermore, a unique technique is proposed that involves mixing the input audio with additional audio, and using the additional audio as a reference. This results in the difference between the mixed audio and the reference audio reverting back to the original input audio. This allows the original input's caption to be used as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis