Dynaword: From One-shot to Continuously Developed Datasets

Kenneth Enevoldsen; Kristian N{\o}rgaard Jensen; Jan Kostkan; Bal\'azs Szab\'o; M\'arton Kardos; Kirten Vad; Johan Heinsen; Andrea Blasi N\'u\~nez; Gianluca Barmina; Jacob Nielsen; Rasmus Larsen; Peter Vahlstrup; Per M{\o}ldrup Dalum; Desmond Elliott; Lukas Galke; Peter Schneider-Kamp; Kristoffer Nielbo

arXiv:2508.02271·cs.CL·August 6, 2025

Dynaword: From One-shot to Continuously Developed Datasets

Kenneth Enevoldsen, Kristian N{\o}rgaard Jensen, Jan Kostkan, Bal\'azs Szab\'o, M\'arton Kardos, Kirten Vad, Johan Heinsen, Andrea Blasi N\'u\~nez, Gianluca Barmina, Jacob Nielsen, Rasmus Larsen, Peter Vahlstrup, Per M{\o}ldrup Dalum, Desmond Elliott, Lukas Galke

PDF

Open Access 6 Models 3 Datasets

TL;DR

Dynaword introduces a framework and implementation for creating large, openly licensed NLP datasets that are continuously updated through community collaboration, enhancing dataset quality, size, and longevity.

Contribution

The paper presents the Dynaword framework and Danish Dynaword implementation, enabling ongoing community-driven dataset development in NLP.

Findings

01

Danish Dynaword has over four times more tokens than comparable datasets.

02

It is exclusively openly licensed, facilitating sharing and derivative works.

03

The dataset includes quality assurance tests for data formatting and documentation.

Abstract

Large-scale datasets are foundational for research and development in natural language processing. However, current approaches face three key challenges: (1) reliance on ambiguously licensed sources restricting use, sharing, and derivative works; (2) static dataset releases that prevent community contributions and diminish longevity; and (3) quality assurance processes restricted to publishing teams rather than leveraging community expertise. To address these limitations, we introduce two contributions: the Dynaword approach and Danish Dynaword. The Dynaword approach is a framework for creating large-scale, open datasets that can be continuously updated through community collaboration. Danish Dynaword is a concrete implementation that validates this approach and demonstrates its potential. Danish Dynaword contains over four times as many tokens as comparable releases, is exclusively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Natural Language Processing Techniques · Biomedical Text Mining and Ontologies