Open Artificial Knowledge

Vadim Borisov; Richard H. Schreiber

arXiv:2407.14371·cs.CL·July 22, 2024

Open Artificial Knowledge

Vadim Borisov, Richard H. Schreiber

PDF

Open Access 1 Datasets

TL;DR

The paper introduces the OAK dataset, a large-scale, high-quality text resource generated using multiple advanced LLMs, aimed at improving data diversity, coverage, and ethical sourcing for training future language models.

Contribution

It presents the creation of the OAK dataset, leveraging ensemble LLMs and Wikipedia guidance to address data scarcity and quality issues in LLM training.

Findings

01

OAK contains over 500 million tokens.

02

Generated data covers diverse domains with high coherence.

03

Dataset is publicly available for research.

Abstract

The tremendous success of chat-based AI systems like ChatGPT, Claude, and Gemini stems from Large Language Models (LLMs) trained on vast amount of datasets. However, acquiring high-quality, diverse, and ethically sourced training data remains a significant challenge. We introduce the Open Artificial Knowledge (OAK) dataset, a large-scale resource of over 500 million tokens (at the moment of writing) designed to address this issue. OAK leverages an ensemble of state-of-the-art LLMs, including GPT4o, LLaMa3-70B, LLaMa3-8B, Mixtral-8x7B, Gemma-7B, and Gemma-2-9B , to generate high-quality text across diverse domains, guided by Wikipedia's main categories. Our methodology ensures broad knowledge coverage while maintaining coherence and factual accuracy. The OAK dataset aims to foster the development of more capable and aligned language models while addressing critical issues of data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

tabularisai/oak
dataset· 109 dl
109 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies