# The ROOTS Search Tool: Data Transparency for LLMs

**Authors:** Aleksandra Piktus, Christopher Akiki, Paulo Villegas, Hugo, Lauren\c{c}on, G\'erard Dupont, Alexandra Sasha Luccioni, Yacine Jernite,, Anna Rogers

arXiv: 2302.14035 · 2023-02-28

## TL;DR

The paper introduces the ROOTS Search Tool, an open-source search engine for the extensive ROOTS multilingual corpus, enhancing data transparency and governance for large language model training.

## Contribution

It presents the development and implementation of a comprehensive search tool for the ROOTS corpus, enabling detailed investigation and transparency of data used in LLM training.

## Key findings

- Largest searchable corpus with fuzzy and exact search capabilities
- Open-sourced tool available on Hugging Face Spaces
- Facilitates data transparency and governance in LLM training

## Abstract

ROOTS is a 1.6TB multilingual text corpus developed for the training of BLOOM, currently the largest language model explicitly accompanied by commensurate data governance efforts. In continuation of these efforts, we present the ROOTS Search Tool: a search engine over the entire ROOTS corpus offering both fuzzy and exact search capabilities. ROOTS is the largest corpus to date that can be investigated this way. The ROOTS Search Tool is open-sourced and available on Hugging Face Spaces. We describe our implementation and the possible use cases of our tool.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2302.14035/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/2302.14035/full.md

## References

54 references — full list in the complete paper: https://tomesphere.com/paper/2302.14035/full.md

---
Source: https://tomesphere.com/paper/2302.14035