Dynaword: From One-shot to Continuously Developed Datasets
Kenneth Enevoldsen, Kristian N{\o}rgaard Jensen, Jan Kostkan, Bal\'azs Szab\'o, M\'arton Kardos, Kirten Vad, Johan Heinsen, Andrea Blasi N\'u\~nez, Gianluca Barmina, Jacob Nielsen, Rasmus Larsen, Peter Vahlstrup, Per M{\o}ldrup Dalum, Desmond Elliott, Lukas Galke

TL;DR
Dynaword introduces a framework and implementation for creating large, openly licensed NLP datasets that are continuously updated through community collaboration, enhancing dataset quality, size, and longevity.
Contribution
The paper presents the Dynaword framework and Danish Dynaword implementation, enabling ongoing community-driven dataset development in NLP.
Findings
Danish Dynaword has over four times more tokens than comparable datasets.
It is exclusively openly licensed, facilitating sharing and derivative works.
The dataset includes quality assurance tests for data formatting and documentation.
Abstract
Large-scale datasets are foundational for research and development in natural language processing. However, current approaches face three key challenges: (1) reliance on ambiguously licensed sources restricting use, sharing, and derivative works; (2) static dataset releases that prevent community contributions and diminish longevity; and (3) quality assurance processes restricted to publishing teams rather than leveraging community expertise. To address these limitations, we introduce two contributions: the Dynaword approach and Danish Dynaword. The Dynaword approach is a framework for creating large-scale, open datasets that can be continuously updated through community collaboration. Danish Dynaword is a concrete implementation that validates this approach and demonstrates its potential. Danish Dynaword contains over four times as many tokens as comparable releases, is exclusively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗danish-foundation-models/gemma-3-1b-cpt-dynaword-full-v1model· 1 dl1 dl
- 🤗danish-foundation-models/gemma-3-1b-cpt-dynaword-matched-v1model· 1 dl1 dl
- 🤗danish-foundation-models/gemma-3-1b-cpt-gigaword-v1model
- 🤗danish-foundation-models/gemma-3-1b-scratch-dynaword-full-v1model· 8 dl8 dl
- 🤗danish-foundation-models/gemma-3-1b-scratch-dynaword-matched-v1model· 3 dl3 dl
- 🤗danish-foundation-models/gemma-3-1b-scratch-gigaword-v1model
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Natural Language Processing Techniques · Biomedical Text Mining and Ontologies
