SomaliWeb v1: A Quality-Filtered Somali Web Corpus with a Matched Tokenizer and a Public Language-Identification Benchmark

Khalid Yusuf Dahir

arXiv:2605.18232·cs.CL·May 19, 2026

SomaliWeb v1: A Quality-Filtered Somali Web Corpus with a Matched Tokenizer and a Public Language-Identification Benchmark

Khalid Yusuf Dahir

PDF

1 Datasets

TL;DR

SomaliWeb v1 is a curated Somali language corpus with a dedicated tokenizer and benchmark, addressing gaps in existing resources for Somali NLP research.

Contribution

It introduces a high-quality Somali corpus, a matched tokenizer, and a public language-identification benchmark, filling a critical resource gap.

Findings

01

Existing Somali datasets contain significant duplicates and quality issues.

02

The new tokenizer reduces token count by 40.2% compared to GPT-4's tokenizer.

03

The corpus and tools enable better Somali NLP model development.

Abstract

Somali is a Cushitic language of the Horn of Africa with ~25 million speakers, yet no documented dedicated Somali pretraining corpus with a companion tokenizer and language-identification benchmark has been publicly released. Existing Somali text appears either inside multilingual distributions (HPLT v2, CC100, MADLAD-400, OSCAR, mC4) or in small, undocumented Somali-only uploads on Hugging Face. We introduce SomaliWeb v1, a quality-filtered Somali corpus of 819,322 documents (~303M tokens) built from three upstream sources (HPLT v2, CC100, Somali Wikipedia) through a six-stage reproducible pipeline. We release (i) the corpus, (ii) a matched BPE-16K tokenizer, and (iii) the first public side-by-side Somali benchmark of three production language identifiers. Our measurements reveal concrete quality defects in existing distributions: HPLT v2's "cleaned" Somali release retains 17.3%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

khaledyusuf44/somaliweb-v1
dataset· 229 dl
229 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.