The power of context: Random Forest classification of near synonyms. A case study in Modern Hindi
Jacek B\k{a}kowski

TL;DR
This study demonstrates that Random Forest classifiers trained on word embeddings can distinguish Hindi synonyms by their etymological origins, revealing that usage patterns encode historical and cultural signals.
Contribution
It provides quantitative evidence that context captures subtle etymological distinctions in synonyms, supporting the idea that language reflects cultural and historical influences.
Findings
Random Forest classifies Hindi synonyms by origin with high accuracy.
Usage patterns preserve traces of etymology even when words are semantically unrelated.
Context encodes subtle distinctions linked to historical language contact.
Abstract
Synonymy is a widespread yet puzzling linguistic phenomenon. Absolute synonyms theoretically should not exist, as they do not expand language's expressive potential. However, it was suggested that even if synonyms denote the same concept, they may reflect different perspectives or carry distinct cultural associations, claims that have rarely been tested quantitatively. In Hindi, prolonged contact with Persian produced many Perso-Arabic loanwords coexisting with their Sanskrit counterpart, forming numerous synonym pairs. This study investigates whether centuries after these borrowings appeared in the Subcontinent their origin can still be distinguished using distributional data alone and regardless of their semantic content. A Random Forest trained on word embeddings of Hindi synonyms successfully classified words by Sanskrit or Perso-Arabic origin, even when they were semantically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
