Open Artificial Knowledge
Vadim Borisov, Richard H. Schreiber

TL;DR
The paper introduces the OAK dataset, a large-scale, high-quality text resource generated using multiple advanced LLMs, aimed at improving data diversity, coverage, and ethical sourcing for training future language models.
Contribution
It presents the creation of the OAK dataset, leveraging ensemble LLMs and Wikipedia guidance to address data scarcity and quality issues in LLM training.
Findings
OAK contains over 500 million tokens.
Generated data covers diverse domains with high coherence.
Dataset is publicly available for research.
Abstract
The tremendous success of chat-based AI systems like ChatGPT, Claude, and Gemini stems from Large Language Models (LLMs) trained on vast amount of datasets. However, acquiring high-quality, diverse, and ethically sourced training data remains a significant challenge. We introduce the Open Artificial Knowledge (OAK) dataset, a large-scale resource of over 500 million tokens (at the moment of writing) designed to address this issue. OAK leverages an ensemble of state-of-the-art LLMs, including GPT4o, LLaMa3-70B, LLaMa3-8B, Mixtral-8x7B, Gemma-7B, and Gemma-2-9B , to generate high-quality text across diverse domains, guided by Wikipedia's main categories. Our methodology ensures broad knowledge coverage while maintaining coherence and factual accuracy. The OAK dataset aims to foster the development of more capable and aligned language models while addressing critical issues of data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies
