The Growing Gains and Pains of Iterative Web Corpora Crawling: Insights from South Slavic CLASSLA-web 2.0 Corpora
Taja Kuzman Punger\v{s}ek, Peter Rupnik, V\'it Suchomel, Nikola Ljube\v{s}i\'c

TL;DR
This paper presents the development and analysis of the CLASSLA-web 2.0 corpus, a large, iteratively crawled web corpus for South Slavic languages, highlighting both its growth and challenges such as content degradation and machine-generated sites.
Contribution
It introduces a new iterative crawling infrastructure and the resulting large, annotated corpus for South Slavic languages, expanding resources for linguistic research.
Findings
The corpus contains 17.0 billion words across seven languages.
Only 20% overlap with previous corpus, indicating substantial new content.
Degradation of web content and rise of machine-generated texts observed.
Abstract
Crawling national top-level domains has proven to be highly effective for collecting texts in less-resourced languages. This approach has been recently used for South Slavic languages and resulted in the largest general corpora for this language group: the CLASSLA-web 1.0 corpora. Building on this success, we established a continuous crawling infrastructure for iterative national top-level domain crawling across South Slavic and related webs. We present the first outcome of this crawling infrastructure - the CLASSLA-web 2.0 corpus collection, with substantially larger web corpora containing 17.0 billion words in 38.1 million texts in seven languages: Bosnian, Bulgarian, Croatian, Macedonian, Montenegrin, Serbian, and Slovenian. In addition to genre categories, the new version is also automatically annotated with topic labels. Comparing CLASSLA-web 2.0 with its predecessor reveals that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Authorship Attribution and Profiling · Text Readability and Simplification
