Crowdsourcing and Evaluating Text-Based Audio Retrieval Relevances

Huang Xie; Khazar Khorrami; Okko R\"as\"anen; Tuomas Virtanen

arXiv:2306.09820·eess.AS·August 16, 2023·1 cites

Crowdsourcing and Evaluating Text-Based Audio Retrieval Relevances

Huang Xie, Khazar Khorrami, Okko R\"as\"anen, Tuomas Virtanen

PDF

Open Access 1 Repo

TL;DR

This study investigates the use of crowdsourced numeric relevance scores for text-based audio retrieval and finds that binary relevances from captioning are sufficient for effective contrastive learning, with limited benefit from crowdsourced scores.

Contribution

It introduces a method for grading audio-text relevance via crowdsourcing and evaluates its impact on retrieval system training and evaluation.

Findings

01

Crowdsourced relevance scores do not significantly improve retrieval when binarized.

02

Binary relevances from captioning are sufficient for contrastive learning.

03

Using only caption-based binary labels is effective for training audio retrieval systems.

Abstract

This paper explores grading text-based audio retrieval relevances with crowdsourcing assessments. Given a free-form text (e.g., a caption) as a query, crowdworkers are asked to grade audio clips using numeric scores (between 0 and 100) to indicate their judgements of how much the sound content of an audio clip matches the text, where 0 indicates no content match at all and 100 indicates perfect content match. We integrate the crowdsourced relevances into training and evaluating text-based audio retrieval systems, and evaluate the effect of using them together with binary relevances from audio captioning. Conventionally, these binary relevances are defined by captioning-based audio-caption pairs, where being positive indicates that the caption describes the paired audio, and being negative applies to all other pairs. Experimental results indicate that there is no clear benefit from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xieh97/retrieval-relevance-crowdsourcing
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing

MethodsContrastive Language-Image Pre-training