The Danish Gigaword Project

Leon Str{\o}mberg-Derczynski; Manuel R. Ciosici; Rebekah Baglini,; Morten H. Christiansen; Jacob Aarup Dalsgaard; Riccardo Fusaroli; Peter Juel; Henrichsen; Rasmus Hvingelby; Andreas Kirkedal; Alex Speed Kjeldsen; Claus; Ladefoged; Finn {\AA}rup Nielsen; Malte Lau Petersen; Jonathan Hvithamar; Rystr{\o}m; Daniel Varab

arXiv:2005.03521·cs.CL·May 14, 2021·6 cites

The Danish Gigaword Project

Leon Str{\o}mberg-Derczynski, Manuel R. Ciosici, Rebekah Baglini,, Morten H. Christiansen, Jacob Aarup Dalsgaard, Riccardo Fusaroli, Peter Juel, Henrichsen, Rasmus Hvingelby, Andreas Kirkedal, Alex Speed Kjeldsen, Claus, Ladefoged, Finn {\AA}rup Nielsen, Malte Lau Petersen

PDF

Open Access 4 Models 4 Datasets

TL;DR

The Danish Gigaword Project created a comprehensive, freely available one-billion-word Danish corpus covering diverse sources, dialects, and socio-economic backgrounds to advance NLP research in Danish.

Contribution

It introduces the Danish Gigaword Corpus, a large, diverse, and accessible dataset addressing previous resource limitations for Danish language technology.

Findings

01

Provides a broad-coverage Danish corpus of one billion words

02

Includes diverse sources, dialects, and socio-economic backgrounds

03

Enables improved NLP applications for Danish

Abstract

Danish language technology has been hindered by a lack of broad-coverage corpora at the scale modern NLP prefers. This paper describes the Danish Gigaword Corpus, the result of a focused effort to provide a diverse and freely-available one billion word corpus of Danish text. The Danish Gigaword corpus covers a wide array of time periods, domains, speakers' socio-economic status, and Danish dialects.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · linguistics and terminology studies