Wiki Dumps to Training Corpora: South Slavic Case
Mihailo \v{S}kori\'c, Cosimo Palma

TL;DR
This paper introduces a pipeline for transforming Wikimedia dumps into high-quality, linguistically rich corpora for South Slavic languages, suitable for training language models and research.
Contribution
It presents a systematic extraction and filtering method to create reliable corpora from raw Wikimedia data, addressing low-quality content issues.
Findings
Successfully extracted and cleaned corpora for seven South Slavic languages.
Implemented an n-gram-based filtering to remove low-quality, repetitive articles.
Produced datasets suitable for language modeling and comparative linguistic research.
Abstract
This paper presents a pipeline designed to transform raw Wikimedia dumps into quality textual corpora for seven South Slavic languages. The work is divided into two major phases. The first involves extracting and cleaning text from raw dumps of Wikipedia, Wikisource, Wikibooks, Wikinews, and Wikiquote. This step requires careful handling of raw wiki markup to isolate, first of all, textual articles, and then usable natural language text within them. The second phase addresses the challenge of questionable or low-quality articles, which are often generated from databases or structured knowledge bases. These articles are characterised by repetitive patterns, generic phrasing, and minimal to no original content. To mitigate their impact, a n-gram-based filtering strategy was employed to detect high levels of textual redundancy between articles and then remove such articles from the corpora…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
