Audio-text Retrieval in Context

Siyu Lou; Xuenan Xu; Mengyue Wu; Kai Yu

arXiv:2203.13645·cs.SD·March 30, 2022

Audio-text Retrieval in Context

Siyu Lou, Xuenan Xu, Mengyue Wu, Kai Yu

PDF

Open Access

TL;DR

This paper presents a new audio-text retrieval system that leverages pre-trained audio features and descriptor-based aggregation, significantly improving retrieval performance by emphasizing semantic mapping over temporal relations.

Contribution

The work introduces a novel combination of pre-trained audio features and descriptor-based aggregation for improved contextual audio-text retrieval.

Findings

01

Significant improvement in retrieval metrics on AudioCaps and CLOTHO datasets.

02

Semantic mapping is more crucial than temporal relations in contextual retrieval.

03

Utilization of PANNs features and NetRVLAD pooling enhances system performance.

Abstract

Audio-text retrieval based on natural language descriptions is a challenging task. It involves learning cross-modality alignments between long sequences under inadequate data conditions. In this work, we investigate several audio features as well as sequence aggregation methods for better audio-text alignment. Moreover, through a qualitative analysis we observe that semantic mapping is more important than temporal relations in contextual retrieval. Using pre-trained audio features and a descriptor-based aggregation method, we build our contextual audio-text retrieval system. Specifically, we utilize PANNs features pre-trained on a large sound event dataset and NetRVLAD pooling, which directly works with averaged descriptors. Experiments are conducted on the AudioCaps and CLOTHO datasets, and results are compared with the previous state-of-the-art system. With our proposed system, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis