FineFreq: A Multilingual Character Frequency Dataset from Web-Scale Text
Binbin Xu

TL;DR
FineFreq is a comprehensive multilingual character frequency dataset from web-scale text, enabling detailed analysis of character usage across over 1900 languages and multiple years, supporting diverse linguistic and computational research.
Contribution
This work introduces FineFreq, the largest multilingual character frequency dataset derived from web-scale text, with detailed per-character, per-language, and temporal statistics, including Unicode metadata.
Findings
Contains 96 trillion character counts from 57 TB of text.
Supports fine-grained temporal and multilingual analysis.
Includes Unicode metadata for advanced filtering.
Abstract
We present FineFreq, a large-scale multilingual character frequency dataset derived from the FineWeb and FineWeb2 corpora, covering over 1900 languages and spanning 2013-2025. The dataset contains frequency counts for 96 trillion characters processed from 57 TB of compressed text. For each language, FineFreq provides per-character statistics with aggregate and year-level frequencies, allowing fine-grained temporal analysis. The dataset preserves naturally occurring multilingual features such as cross-script borrowings, emoji, and acronyms without applying artificial filtering. Each character entry includes Unicode metadata (category, script, block), enabling domain-specific or other downstream filtering and analysis. The full dataset is released in both CSV and Parquet formats, with associated metadata, available on GitHub and HuggingFace. https://github.com/Bin-2/FineFreq
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Natural Language Processing Techniques · Digital Humanities and Scholarship
