FAIR Enough: How Can We Develop and Assess a FAIR-Compliant Dataset for Large Language Models' Training?
Shaina Raza, Shardul Ghuge, Chen Ding, Elham Dolatabadi, Deval Pandya

TL;DR
This paper proposes a new framework and checklist to develop and assess datasets for training Large Language Models that adhere to FAIR principles, aiming to improve data ethics and integrity.
Contribution
It introduces a comprehensive framework and checklist for integrating FAIR data principles into LLM training, filling a gap in ethical data management for AI.
Findings
Validated framework through a case study on bias detection datasets
Demonstrated improved data management for ethical LLM development
Provided practical tools for FAIR compliance in AI datasets
Abstract
The rapid evolution of Large Language Models (LLMs) highlights the necessity for ethical considerations and data integrity in AI development, particularly emphasizing the role of FAIR (Findable, Accessible, Interoperable, Reusable) data principles. While these principles are crucial for ethical data stewardship, their specific application in the context of LLM training data remains an under-explored area. This research gap is the focus of our study, which begins with an examination of existing literature to underline the importance of FAIR principles in managing data for LLM training. Building upon this, we propose a novel framework designed to integrate FAIR principles into the LLM development lifecycle. A contribution of our work is the development of a comprehensive checklist intended to guide researchers and developers in applying FAIR data principles consistently across the model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsResearch Data Management Practices · Natural Language Processing Techniques · Semantic Web and Ontologies
MethodsFocus
