Automatic Creation of Text Corpora for Low-Resource Languages from the   Internet: The Case of Swiss German

Lucy Linder; Michael Jungo; Jean Hennebert; Claudiu Musat; Andreas; Fischer

arXiv:1912.00159·cs.CL·June 17, 2020·5 cites

Automatic Creation of Text Corpora for Low-Resource Languages from the Internet: The Case of Swiss German

Lucy Linder, Michael Jungo, Jean Hennebert, Claudiu Musat, Andreas, Fischer

PDF

Open Access 3 Repos

TL;DR

This paper introduces SwissCrawl, a large Swiss German text corpus created via web scraping, demonstrating its usefulness for improving language modeling in low-resource languages.

Contribution

It presents a scalable method for building large text corpora from web data for low-resource languages, exemplified by Swiss German.

Findings

01

Significant improvement in language modeling performance using SwissCrawl

02

The web scraping approach is adaptable to other low-resource languages

03

Continuous updates enhance corpus relevance over time

Abstract

This paper presents SwissCrawl, the largest Swiss German text corpus to date. Composed of more than half a million sentences, it was generated using a customized web scraping tool that could be applied to other low-resource languages as well. The approach demonstrates how freely available web pages can be used to construct comprehensive text corpora, which are of fundamental importance for natural language processing. In an experimental evaluation, we show that using the new corpus leads to significant improvements for the task of language modeling. To capture new content, our approach will run continuously to keep increasing the corpus over time.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques