Data leakage in cross-modal retrieval training: A case study

Benno Weck; Xavier Serra

arXiv:2302.12258·cs.SD·August 29, 2023·1 cites

Data leakage in cross-modal retrieval training: A case study

Benno Weck, Xavier Serra

PDF

Open Access

TL;DR

This paper investigates data leakage issues in the SoundDesc cross-modal audio retrieval dataset, revealing duplicates that inflate performance metrics, and proposes new splits to provide a more accurate and challenging benchmark.

Contribution

The study identifies data leakage in SoundDesc and introduces revised dataset splits to improve evaluation integrity in cross-modal retrieval.

Findings

01

Original splits contained duplicates causing data leakage.

02

New splits reduce leakage and increase task difficulty.

03

Revised dataset splits lead to more realistic performance assessments.

Abstract

The recent progress in text-based audio retrieval was largely propelled by the release of suitable datasets. Since the manual creation of such datasets is a laborious task, obtaining data from online resources can be a cheap solution to create large-scale datasets. We study the recently proposed SoundDesc benchmark dataset, which was automatically sourced from the BBC Sound Effects web page. In our analysis, we find that SoundDesc contains several duplicates that cause leakage of training data to the evaluation data. This data leakage ultimately leads to overly optimistic retrieval performance estimates in previous benchmarks. We propose new training, validation, and testing splits for the dataset that we make available online. To avoid weak contamination of the test data, we pool audio files that share similar recording setups. In our experiments, we find that the new splits serve as a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Diverse Musicological Studies

MethodsTest