# Going over Fine Web with a Fine-Tooth Comb: Technical Report of Indexing Fine Web for Problematic Content Search and Retrieval

**Authors:** In\'es Altemir Marinas, Anastasiia Kucherenko, Andrei Kucharavy

arXiv: 2508.21788 · 2025-09-01

## TL;DR

This paper introduces a framework for indexing and analyzing large-scale web datasets used for training language models, enabling real-time, safe, and accountable AI data management.

## Contribution

It presents an ElasticSearch-based pipeline for efficient, real-time analysis of multi-terabyte web datasets, improving data quality assessment for LLM training.

## Key findings

- Fast query performance under 2 seconds
- Real-time dataset analysis capability
- Enhanced safety and accountability in AI training data

## Abstract

Large language models (LLMs) rely heavily on web-scale datasets like Common Crawl, which provides over 80\% of training data for some modern models. However, the indiscriminate nature of web crawling raises challenges in data quality, safety, and ethics. Despite the critical importance of training data quality, prior research on harmful content has been limited to small samples due to computational constraints. This project presents a framework for indexing and analyzing LLM training datasets using an ElasticSearch-based pipeline. We apply it to SwissAI's FineWeb-2 corpus (1.5TB, four languages), achieving fast query performance--most searches in milliseconds, all under 2 seconds. Our work demonstrates real-time dataset analysis, offering practical tools for safer, more accountable AI systems.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.21788/full.md

## References

11 references — full list in the complete paper: https://tomesphere.com/paper/2508.21788/full.md

---
Source: https://tomesphere.com/paper/2508.21788