EmotionCaps: Enhancing Audio Captioning Through Emotion-Augmented Data   Generation

Mithun Manivannan (1); Vignesh Nethrapalli (1); Mark Cartwright (1); ((1) New Jersey Institute of Technology)

arXiv:2410.12028·cs.SD·October 17, 2024

EmotionCaps: Enhancing Audio Captioning Through Emotion-Augmented Data Generation

Mithun Manivannan (1), Vignesh Nethrapalli (1), Mark Cartwright (1), ((1) New Jersey Institute of Technology)

PDF

Open Access

TL;DR

EmotionCaps introduces an emotion-augmented dataset for audio captioning, leveraging emotional context to generate higher-quality descriptions and improve model performance in environmental sound understanding.

Contribution

The paper presents a novel dataset with emotion-augmented synthetic captions and demonstrates how emotional information enhances audio captioning models.

Findings

01

Emotion-augmented data improves caption quality.

02

Models trained with EmotionCaps outperform baselines.

03

Emotional context aligns captions better with audio content.

Abstract

Recent progress in audio-language modeling, such as automated audio captioning, has benefited from training on synthetic data generated with the aid of large-language models. However, such approaches for environmental sound captioning have primarily focused on audio event tags and have not explored leveraging emotional information that may be present in recordings. In this work, we explore the benefit of generating emotion-augmented synthetic audio caption data by instructing ChatGPT with additional acoustic information in the form of estimated soundscape emotion. To do so, we introduce EmotionCaps, an audio captioning dataset comprised of approximately 120,000 audio clips with paired synthetic descriptions enriched with soundscape emotion recognition (SER) information. We hypothesize that this additional information will result in higher-quality captions that match the emotional tone…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Human Motion and Animation · Speech and dialogue systems