Revisiting Deep Audio-Text Retrieval Through the Lens of Transportation

Manh Luong; Khai Nguyen; Nhat Ho; Reza Haf; Dinh Phung; Lizhen Qu

arXiv:2405.10084·eess.AS·May 17, 2024

Revisiting Deep Audio-Text Retrieval Through the Lens of Transportation

Manh Luong, Khai Nguyen, Nhat Ho, Reza Haf, Dinh Phung, Lizhen Qu

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces a scalable mini-batch learning-to-match framework with partial optimal transport for improved deep audio-text retrieval, achieving state-of-the-art results and noise robustness across multiple datasets.

Contribution

It proposes a mini-batch LTM framework with Mahalanobis metrics and partial optimal transport to address scalability and data misalignment issues in deep audio-text retrieval.

Findings

01

Achieves state-of-the-art performance on AudioCaps, Clotho, and ESC-50 datasets.

02

Surpasses triplet and contrastive loss in zero-shot sound event detection.

03

Demonstrates greater noise tolerance with partial optimal transport.

Abstract

The Learning-to-match (LTM) framework proves to be an effective inverse optimal transport approach for learning the underlying ground metric between two sources of data, facilitating subsequent matching. However, the conventional LTM framework faces scalability challenges, necessitating the use of the entire dataset each time the parameters of the ground metric are updated. In adapting LTM to the deep learning context, we introduce the mini-batch Learning-to-match (m-LTM) framework for audio-text retrieval problems. This framework leverages mini-batch subsampling and Mahalanobis-enhanced family of ground metrics. Moreover, to cope with misaligned training data in practice, we propose a variant using partial optimal transport to mitigate the harm of misaligned data pairs in training data. We conduct extensive experiments on audio-text matching problems using three datasets: AudioCaps,…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

- Well motivated, inverse optimal transport seems worth exploring in the context of deep networks and minibatch training. - Strong results. The approach consistently outperforms existing SOTA text-audio retrieval results on the most popular datasets, and the most widely used contrastive objectives. - Generally well presented.

Weaknesses

- As their results are much better than previous approaches and standard contrastive training methods, I feel that this warrants further investigation. The training sets for AudioCaps and Clotho are rather small at 46K and 5K audio examples, respectively, and so regularization may be a very important factor. Their m-LTM approach is entropy regularized, while their Triplet and Constrastive baselines are not. An entropy-regularized constrastive loss baseline is the most natural analog here, and wo

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

The algorithmic design of this approach is well motivated and (for the most part) well-described. (The application of Projection Gradient Descent is effective.) The performance is particularly good compared to triplet and contrastive loss.

Weaknesses

Retrieval applications have an expectation of scaling. Ideally a single query would be used to retrieve one or more corresponding examples from an extremely large source. However, in this paper the datasets (particularly the test sets) have a fairly small source to retrieve from (a few thousand examples typically). It would strengthen the work substantially to demonstrate the capabilities of the algorithm to scale to instances where there are orders of magnitude more examples to retrieve from

Reviewer 03Rating 8· accept, good paperConfidence 4

Strengths

1. The study itself is well-motivated, novel and practical. 2. The mathematical foundation is good. 3. The experiments have been done on commonly-known datasets and the improvements are clearly-observed.

Weaknesses

1. Apart from the performance, it would be good to also show the acquired network architecture, and run-time efficiency for cross-modal inference. 2. The reference of "noise" is not clear and potentially confusing, even with clear references. When talking about the noise, it can be many things. Especially for speech people who are very likely refer to this paper, seems like the definition of "noise" is different from real-world interruption - it is totally fine, but please spend some text on cla

Code & Models

Repositories

v-manhlt3/m-ltm-audio-text-retrieval
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Diverse Musicological Studies · Music Technology and Sound Studies