Using Large Language Models to Enrich the Documentation of Datasets for Machine Learning
Joan Giner-Miguelez, Abel G\'omez, Jordi Cabot

TL;DR
This paper presents a method using large language models to automatically extract key dataset documentation dimensions, enhancing machine-readable descriptions for better compliance, discoverability, and quality assessment of datasets in ML.
Contribution
The work introduces a prompt-based approach leveraging LLMs to extract dataset documentation dimensions, with an open-source tool and validation on scientific dataset papers.
Findings
GPT3.5 achieves 81.21% accuracy in extraction
Prompt strategies improve extraction accuracy
Open-source tool available for replication
Abstract
Recent regulatory initiatives like the European AI Act and relevant voices in the Machine Learning (ML) community stress the need to describe datasets along several key dimensions for trustworthy AI, such as the provenance processes and social concerns. However, this information is typically presented as unstructured text in accompanying documentation, hampering their automated analysis and processing. In this work, we explore using large language models (LLM) and a set of prompting strategies to automatically extract these dimensions from documents and enrich the dataset description with them. Our approach could aid data publishers and practitioners in creating machine-readable documentation to improve the discoverability of their datasets, assess their compliance with current AI regulations, and improve the overall quality of ML models trained on them. In this paper, we evaluate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsSparse Evolutionary Training
