TL;DR
This paper introduces a dataset and a supervised learning algorithm to predict the future dominant words in synsets, providing insights into language evolution from 1800 to 2000 AD.
Contribution
It presents a novel dataset combining WordNet and Google Ngram data and a predictive model for language change, advancing understanding of lexical evolution.
Findings
The model predicts synset leaders 50 years ahead with high accuracy.
Analysis confirms linguistic trends like suffix replacement and economy versus clarity.
Integration of datasets enhances understanding of language evolution.
Abstract
We introduce a dataset for studying the evolution of words, constructed from WordNet and the Google Books Ngram Corpus. The dataset tracks the evolution of 4,000 synonym sets (synsets), containing 9,000 English words, from 1800 AD to 2000 AD. We present a supervised learning algorithm that is able to predict the future leader of a synset: the word in the synset that will have the highest frequency. The algorithm uses features based on a word's length, the characters in the word, and the historical frequencies of the word. It can predict change of leadership (including the identity of the new leader) fifty years in the future, with an F-score considerably above random guessing. Analysis of the learned models provides insight into the causes of change in the leader of a synset. The algorithm confirms observations linguists have made, such as the trend to replace the -ise suffix with -ize,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
