Audio-text Retrieval in Context
Siyu Lou, Xuenan Xu, Mengyue Wu, Kai Yu

TL;DR
This paper presents a new audio-text retrieval system that leverages pre-trained audio features and descriptor-based aggregation, significantly improving retrieval performance by emphasizing semantic mapping over temporal relations.
Contribution
The work introduces a novel combination of pre-trained audio features and descriptor-based aggregation for improved contextual audio-text retrieval.
Findings
Significant improvement in retrieval metrics on AudioCaps and CLOTHO datasets.
Semantic mapping is more crucial than temporal relations in contextual retrieval.
Utilization of PANNs features and NetRVLAD pooling enhances system performance.
Abstract
Audio-text retrieval based on natural language descriptions is a challenging task. It involves learning cross-modality alignments between long sequences under inadequate data conditions. In this work, we investigate several audio features as well as sequence aggregation methods for better audio-text alignment. Moreover, through a qualitative analysis we observe that semantic mapping is more important than temporal relations in contextual retrieval. Using pre-trained audio features and a descriptor-based aggregation method, we build our contextual audio-text retrieval system. Specifically, we utilize PANNs features pre-trained on a large sound event dataset and NetRVLAD pooling, which directly works with averaged descriptors. Experiments are conducted on the AudioCaps and CLOTHO datasets, and results are compared with the previous state-of-the-art system. With our proposed system, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
