A fully data-driven method to identify (correlated) changes in diachronic corpora
Alexander Koplenig

TL;DR
This paper introduces a fully data-driven, computationally efficient method for detecting and interpreting correlated linguistic changes over time in diachronic corpora, linking trends to historical events.
Contribution
It extends a corpus similarity measure to identify and interpret diachronic trends and correlated changes, improving understanding of language evolution and NLP applications.
Findings
Method is computationally cheap and interpretable.
Effectively identifies correlated linguistic shifts linked to historical events.
Enhances diachronic POS tagging and complements existing NLP methods.
Abstract
In this paper, a method for measuring synchronic corpus (dis-)similarity put forward by Kilgarriff (2001) is adapted and extended to identify trends and correlated changes in diachronic text data, using the Corpus of Historical American English (Davies 2010a) and the Google Ngram Corpora (Michel et al. 2010a). This paper shows that this fully data-driven method, which extracts word types that have undergone the most pronounced change in frequency in a given period of time, is computationally very cheap and that it allows interpretations of diachronic trends that are both intuitively plausible and motivated from the perspective of information theory. Furthermore, it demonstrates that the method is able to identify correlated linguistic changes and diachronic shifts that can be linked to historical events. Finally, it can help to improve diachronic POS tagging and complement existing NLP…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage and cultural evolution · Natural Language Processing Techniques · Authorship Attribution and Profiling
