Statistically significant detection of semantic shifts using contextual word embeddings
Yang Liu, Alan Medlar, Dorota Glowacka

TL;DR
This paper introduces a statistically rigorous method for detecting semantic shifts in words over time using contextual embeddings and permutation tests, improving robustness especially in small datasets.
Contribution
It combines contextual word embeddings with permutation-based statistical tests and false discovery rate correction to reliably identify semantic change.
Findings
High precision in simulation tests by reducing false positives
Improved robustness of semantic shift estimates in real-world data
Effective detection of semantic change in small datasets
Abstract
Detecting lexical semantic change in smaller data sets, e.g. in historical linguistics and digital humanities, is challenging due to a lack of statistical power. This issue is exacerbated by non-contextual embedding models that produce one embedding per word and, therefore, mask the variability present in the data. In this article, we propose an approach to estimate semantic shift by combining contextual word embeddings with permutation-based statistical tests. We use the false discovery rate procedure to address the large number of hypothesis tests being conducted simultaneously. We demonstrate the performance of this approach in simulation where it achieves consistently high precision by suppressing false positives. We additionally analyze real-world data from SemEval-2020 Task 1 and the Liverpool FC subreddit corpus. We show that by taking sample variation into account, we can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage and cultural evolution · Complex Network Analysis Techniques · Advanced Text Analysis Techniques
