Caption Feature Space Regularization for Audio Captioning

Yiming Zhang; Hong Yu; Ruoyi Du; Zhanyu Ma; Yuan Dong

arXiv:2204.08409·cs.SD·April 19, 2022

Caption Feature Space Regularization for Audio Captioning

Yiming Zhang, Hong Yu, Ruoyi Du, Zhanyu Ma, Yuan Dong

PDF

Open Access 1 Repo

TL;DR

This paper introduces a two-stage framework for audio captioning that uses contrastive learning to create a proxy feature space, reducing caption disparities and improving model stability across different architectures.

Contribution

The proposed method employs contrastive learning to construct a proxy feature space, enhancing audio captioning stability and performance by aligning correlated captions.

Findings

01

Effective in reducing caption disparities

02

Improves model stability across architectures

03

Demonstrates superior performance on two datasets

Abstract

Audio captioning aims at describing the content of audio clips with human language. Due to the ambiguity of audio, different people may perceive the same audio differently, resulting in caption disparities (i.e., one audio may correlate to several captions with diverse semantics). For that, general audio captioning models achieve the one-to-many training by randomly selecting a correlated caption as the ground truth for each audio. However, it leads to a significant variation in the optimization directions and weakens the model stability. To eliminate this negative effect, in this paper, we propose a two-stage framework for audio captioning: (i) in the first stage, via the contrastive learning, we construct a proxy feature space to reduce the distances between captions correlated to the same audio, and (ii) in the second stage, the proxy feature space is utilized as additional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pris-cv/caption-feature-space-regularization
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Video Analysis and Summarization