TEDI: Trustworthy and Ethical Dataset Indicators to Analyze and Compare Dataset Documentation

Wiebke Hutiri; Mircea Cimpoi; Morgan Scheuerman; Victoria Matthews; Alice Xiang

arXiv:2505.17841·cs.CY·May 26, 2025

TEDI: Trustworthy and Ethical Dataset Indicators to Analyze and Compare Dataset Documentation

Wiebke Hutiri, Mircea Cimpoi, Morgan Scheuerman, Victoria Matthews, Alice Xiang

PDF

TL;DR

This paper introduces TEDI, a comprehensive set of 143 indicators for systematically analyzing and comparing the trustworthy and ethical attributes of multimodal dataset documentation, aiming to improve transparency in AI datasets.

Contribution

The paper presents TEDI, a novel framework with detailed indicators for empirical assessment of dataset documentation's ethical and trustworthy aspects, supported by analysis of over 100 multimodal datasets.

Findings

01

Few datasets document consent, privacy, and harmful content indicators.

02

Documentation quality varies with data collection methods.

03

Scraping is common but less ethical, while direct collection often includes more ethical indicators.

Abstract

Dataset transparency is a key enabler of responsible AI, but insights into multimodal dataset attributes that impact trustworthy and ethical aspects of AI applications remain scarce and are difficult to compare across datasets. To address this challenge, we introduce Trustworthy and Ethical Dataset Indicators (TEDI) that facilitate the systematic, empirical analysis of dataset documentation. TEDI encompasses 143 fine-grained indicators that characterize trustworthy and ethical attributes of multimodal datasets and their collection processes. The indicators are framed to extract verifiable information from dataset documentation. Using TEDI, we manually annotated and analyzed over 100 multimodal datasets that include human voices. We further annotated data sourcing, size, and modality details to gain insights into the factors that shape trustworthy and ethical dimensions across datasets.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.