SomaliWeb v1: A Quality-Filtered Somali Web Corpus with a Matched Tokenizer and a Public Language-Identification Benchmark
Khalid Yusuf Dahir

TL;DR
SomaliWeb v1 is a curated Somali language corpus with a dedicated tokenizer and benchmark, addressing gaps in existing resources for Somali NLP research.
Contribution
It introduces a high-quality Somali corpus, a matched tokenizer, and a public language-identification benchmark, filling a critical resource gap.
Findings
Existing Somali datasets contain significant duplicates and quality issues.
The new tokenizer reduces token count by 40.2% compared to GPT-4's tokenizer.
The corpus and tools enable better Somali NLP model development.
Abstract
Somali is a Cushitic language of the Horn of Africa with ~25 million speakers, yet no documented dedicated Somali pretraining corpus with a companion tokenizer and language-identification benchmark has been publicly released. Existing Somali text appears either inside multilingual distributions (HPLT v2, CC100, MADLAD-400, OSCAR, mC4) or in small, undocumented Somali-only uploads on Hugging Face. We introduce SomaliWeb v1, a quality-filtered Somali corpus of 819,322 documents (~303M tokens) built from three upstream sources (HPLT v2, CC100, Somali Wikipedia) through a six-stage reproducible pipeline. We release (i) the corpus, (ii) a matched BPE-16K tokenizer, and (iii) the first public side-by-side Somali benchmark of three production language identifiers. Our measurements reveal concrete quality defects in existing distributions: HPLT v2's "cleaned" Somali release retains 17.3%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
