DHPLT: large-scale multilingual diachronic corpora and word representations for semantic change modelling
Mariia Fedorova, Andrey Kutuzov, Khonzoda Umarova

TL;DR
DHPLT is a comprehensive open resource of multilingual diachronic corpora spanning 41 languages and three time periods, enabling advanced semantic change research with pre-computed embeddings and lexical data.
Contribution
It introduces the first large-scale multilingual diachronic corpora collection with temporal annotations and pre-computed embeddings, expanding research beyond high-resource languages.
Findings
Provides datasets covering 41 languages across three time periods.
Includes pre-computed word embeddings and lexical substitutions.
Fills a gap in resources for multilingual semantic change modeling.
Abstract
In this resource paper, we present DHPLT, an open collection of diachronic corpora in 41 diverse languages. DHPLT is based on the web-crawled HPLT datasets; we use web crawl timestamps as the approximate signal of document creation time. The collection covers three time periods: 2011-2015, 2020-2021 and 2024-present (1 million documents per time period for each language). We additionally provide pre-computed word type and token embeddings and lexical substitutions for our chosen target words, while at the same time leaving it open for the other researchers to come up with their own target words using the same datasets. DHPLT aims at filling in the current lack of multilingual diachronic corpora for semantic change modelling (beyond a dozen of high-resource languages). It opens the way for a variety of new experimental setups in this field. All the resources described in this paper are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage and cultural evolution · Computational and Text Analysis Methods · Natural Language Processing Techniques
