Audio Difference Captioning Utilizing Similarity-Discrepancy   Disentanglement

Daiki Takeuchi; Yasunori Ohishi; Daisuke Niizumi; Noboru Harada; Kunio; Kashino

arXiv:2308.11923·eess.AS·August 24, 2023·1 cites

Audio Difference Captioning Utilizing Similarity-Discrepancy Disentanglement

Daiki Takeuchi, Yasunori Ohishi, Daisuke Niizumi, Noboru Harada, Kunio, Kashino

PDF

Open Access 1 Repo

TL;DR

This paper introduces Audio Difference Captioning (ADC), a novel task that describes differences between similar audio clips, and proposes a transformer-based model with disentanglement to improve difference extraction, validated on a new dataset.

Contribution

It presents the ADC task, a new dataset, and a transformer model with similarity-discrepancy disentanglement for effective difference captioning in audio.

Findings

01

The proposed model effectively describes audio differences.

02

Attention visualization confirms improved focus on differences.

03

The AudioDiffCaps dataset supports ADC research.

Abstract

We proposed Audio Difference Captioning (ADC) as a new extension task of audio captioning for describing the semantic differences between input pairs of similar but slightly different audio clips. The ADC solves the problem that conventional audio captioning sometimes generates similar captions for similar audio clips, failing to describe the difference in content. We also propose a cross-attention-concentrated transformer encoder to extract differences by comparing a pair of audio clips and a similarity-discrepancy disentanglement to emphasize the difference in the latent space. To evaluate the proposed methods, we built an AudioDiffCaps dataset consisting of pairs of similar but slightly different audio clips with human-annotated descriptions of their differences. The experiment with the AudioDiffCaps dataset showed that the proposed methods solve the ADC task effectively and improve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nttcslab/audio-diff-caps
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Video Analysis and Summarization · Natural Language Processing Techniques