DiffATR: Diffusion-based Generative Modeling for Audio-Text Retrieval

Yifei Xin; Xuxin Cheng; Zhihong Zhu; Xusheng Yang; Yuexian Zou

arXiv:2409.10025·cs.SD·October 18, 2024

DiffATR: Diffusion-based Generative Modeling for Audio-Text Retrieval

Yifei Xin, Xuxin Cheng, Zhihong Zhu, Xusheng Yang, Yuexian Zou

PDF

Open Access

TL;DR

DiffATR introduces a diffusion-based generative framework for audio-text retrieval, modeling joint distributions to improve out-of-domain performance and combining generative and discriminative training strategies.

Contribution

The paper proposes a novel diffusion-based generative model for ATR that captures joint distributions and enhances out-of-domain retrieval capabilities.

Findings

01

Outperforms existing methods on AudioCaps and Clotho datasets.

02

Demonstrates robustness in out-of-domain retrieval scenarios.

03

Combines generative and discriminative training for improved performance.

Abstract

Existing audio-text retrieval (ATR) methods are essentially discriminative models that aim to maximize the conditional likelihood, represented as p(candidates|query). Nevertheless, this methodology fails to consider the intrinsic data distribution p(query), leading to difficulties in discerning out-of-distribution data. In this work, we attempt to tackle this constraint through a generative perspective and model the relationship between audio and text as their joint probability p(candidates,query). To this end, we present a diffusion-based ATR framework (DiffATR), which models ATR as an iterative procedure that progressively generates joint distribution from noise. Throughout its training phase, DiffATR is optimized from both generative and discriminative viewpoints: the generator is refined through a generation loss, while the feature extractor benefits from a contrastive loss, thus…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Music Technology and Sound Studies