Using Large Language Models to Enrich the Documentation of Datasets for   Machine Learning

Joan Giner-Miguelez; Abel G\'omez; Jordi Cabot

arXiv:2404.15320·cs.DL·May 27, 2024·2 cites

Using Large Language Models to Enrich the Documentation of Datasets for Machine Learning

Joan Giner-Miguelez, Abel G\'omez, Jordi Cabot

PDF

Open Access 2 Repos

TL;DR

This paper presents a method using large language models to automatically extract key dataset documentation dimensions, enhancing machine-readable descriptions for better compliance, discoverability, and quality assessment of datasets in ML.

Contribution

The work introduces a prompt-based approach leveraging LLMs to extract dataset documentation dimensions, with an open-source tool and validation on scientific dataset papers.

Findings

01

GPT3.5 achieves 81.21% accuracy in extraction

02

Prompt strategies improve extraction accuracy

03

Open-source tool available for replication

Abstract

Recent regulatory initiatives like the European AI Act and relevant voices in the Machine Learning (ML) community stress the need to describe datasets along several key dimensions for trustworthy AI, such as the provenance processes and social concerns. However, this information is typically presented as unstructured text in accompanying documentation, hampering their automated analysis and processing. In this work, we explore using large language models (LLM) and a set of prompting strategies to automatically extract these dimensions from documents and enrich the dataset description with them. Our approach could aid data publishers and practitioners in creating machine-readable documentation to improve the discoverability of their datasets, assess their compliance with current AI regulations, and improve the overall quality of ML models trained on them. In this paper, we evaluate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsSparse Evolutionary Training