Wiki Dumps to Training Corpora: South Slavic Case

Mihailo \v{S}kori\'c; Cosimo Palma

arXiv:2604.25384·cs.CL·May 18, 2026

Wiki Dumps to Training Corpora: South Slavic Case

Mihailo \v{S}kori\'c, Cosimo Palma

PDF

4 Datasets

TL;DR

This paper introduces a pipeline for transforming Wikimedia dumps into high-quality, linguistically rich corpora for South Slavic languages, suitable for training language models and research.

Contribution

It presents a systematic extraction and filtering method to create reliable corpora from raw Wikimedia data, addressing low-quality content issues.

Findings

01

Successfully extracted and cleaned corpora for seven South Slavic languages.

02

Implemented an n-gram-based filtering to remove low-quality, repetitive articles.

03

Produced datasets suitable for language modeling and comparative linguistic research.

Abstract

This paper presents a pipeline designed to transform raw Wikimedia dumps into quality textual corpora for seven South Slavic languages. The work is divided into two major phases. The first involves extracting and cleaning text from raw dumps of Wikipedia, Wikisource, Wikibooks, Wikinews, and Wikiquote. This step requires careful handling of raw wiki markup to isolate, first of all, textual articles, and then usable natural language text within them. The second phase addresses the challenge of questionable or low-quality articles, which are often generated from databases or structured knowledge bases. These articles are characterised by repetitive patterns, generic phrasing, and minimal to no original content. To mitigate their impact, a n-gram-based filtering strategy was employed to detect high levels of textual redundancy between articles and then remove such articles from the corpora…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.