The Danish Gigaword Project
Leon Str{\o}mberg-Derczynski, Manuel R. Ciosici, Rebekah Baglini,, Morten H. Christiansen, Jacob Aarup Dalsgaard, Riccardo Fusaroli, Peter Juel, Henrichsen, Rasmus Hvingelby, Andreas Kirkedal, Alex Speed Kjeldsen, Claus, Ladefoged, Finn {\AA}rup Nielsen, Malte Lau Petersen

TL;DR
The Danish Gigaword Project created a comprehensive, freely available one-billion-word Danish corpus covering diverse sources, dialects, and socio-economic backgrounds to advance NLP research in Danish.
Contribution
It introduces the Danish Gigaword Corpus, a large, diverse, and accessible dataset addressing previous resource limitations for Danish language technology.
Findings
Provides a broad-coverage Danish corpus of one billion words
Includes diverse sources, dialects, and socio-economic backgrounds
Enables improved NLP applications for Danish
Abstract
Danish language technology has been hindered by a lack of broad-coverage corpora at the scale modern NLP prefers. This paper describes the Danish Gigaword Corpus, the result of a focused effort to provide a diverse and freely-available one billion word corpus of Danish text. The Danish Gigaword corpus covers a wide array of time periods, domains, speakers' socio-economic status, and Danish dialects.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · linguistics and terminology studies
