A quest through interconnected datasets: lessons from highly-cited   ICASSP papers

Cynthia C. S. Liem; Do\u{g}a Ta\c{s}c{\i}lar; Andrew M. Demetriou

arXiv:2410.03676·cs.SD·October 8, 2024

A quest through interconnected datasets: lessons from highly-cited ICASSP papers

Cynthia C. S. Liem, Do\u{g}a Ta\c{s}c{\i}lar, Andrew M. Demetriou

PDF

Open Access

TL;DR

This study analyzes highly-cited ICASSP papers to understand dataset origins and emphasizes the importance of transparency and accountability in data provenance for audio machine learning applications.

Contribution

It provides a detailed investigation into dataset usage in top ICASSP papers and advocates for greater transparency in data origins in the community.

Findings

01

Many datasets have unclear or entangled origins.

02

Current reporting often lacks detailed data provenance.

03

Community should incentivize explicit data documentation.

Abstract

As audio machine learning outcomes are deployed in societally impactful applications, it is important to have a sense of the quality and origins of the data used. Noticing that being explicit about this sense is not trivially rewarded in academic publishing in applied machine learning domains, and neither is included in typical applied machine learning curricula, we present a study into dataset usage connected to the top-5 cited papers at the International Conference on Acoustics, Speech, and Signal Processing (ICASSP). In this, we conduct thorough depth-first analyses towards origins of used datasets, often leading to searches that had to go beyond what was reported in official papers, and ending into unclear or entangled origins. Especially in the current pull towards larger, and possibly generative AI models, awareness of the need for accountability on data provenance is increasing.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Mining Algorithms and Applications · Data Quality and Management

MethodsFocus