Crowdsourcing and Evaluating Text-Based Audio Retrieval Relevances
Huang Xie, Khazar Khorrami, Okko R\"as\"anen, Tuomas Virtanen

TL;DR
This study investigates the use of crowdsourced numeric relevance scores for text-based audio retrieval and finds that binary relevances from captioning are sufficient for effective contrastive learning, with limited benefit from crowdsourced scores.
Contribution
It introduces a method for grading audio-text relevance via crowdsourcing and evaluates its impact on retrieval system training and evaluation.
Findings
Crowdsourced relevance scores do not significantly improve retrieval when binarized.
Binary relevances from captioning are sufficient for contrastive learning.
Using only caption-based binary labels is effective for training audio retrieval systems.
Abstract
This paper explores grading text-based audio retrieval relevances with crowdsourcing assessments. Given a free-form text (e.g., a caption) as a query, crowdworkers are asked to grade audio clips using numeric scores (between 0 and 100) to indicate their judgements of how much the sound content of an audio clip matches the text, where 0 indicates no content match at all and 100 indicates perfect content match. We integrate the crowdsourced relevances into training and evaluating text-based audio retrieval systems, and evaluate the effect of using them together with binary relevances from audio captioning. Conventionally, these binary relevances are defined by captioning-based audio-caption pairs, where being positive indicates that the caption describes the paired audio, and being negative applies to all other pairs. Experimental results indicate that there is no clear benefit from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing
MethodsContrastive Language-Image Pre-training
