Synonym Discovery with Etymology-based Word Embeddings
Seunghyun Yoon, Pablo Estrada, Kyomin Jung

TL;DR
This paper introduces a new method for learning word embeddings based on etymological roots, which is especially effective for logographic languages and does not require large text corpora.
Contribution
The paper presents a novel etymology-based embedding model that leverages etymological graphs, suitable for languages with logographic writing systems, and demonstrates its effectiveness in synonym discovery.
Findings
Effective in Chinese and Sino-Korean vocabularies
Performs well in synonym discovery tasks
Requires only etymological data, not large corpora
Abstract
We propose a novel approach to learn word embeddings based on an extended version of the distributional hypothesis. Our model derives word embedding vectors using the etymological composition of words, rather than the context in which they appear. It has the strength of not requiring a large text corpus, but instead it requires reliable access to etymological roots of words, making it specially fit for languages with logographic writing systems. The model consists on three steps: (1) building an etymological graph, which is a bipartite network of words and etymological roots, (2) obtaining the biadjacency matrix of the etymological graph and reducing its dimensionality, (3) using columns/rows of the resulting matrices as embedding vectors. We test our model in the Chinese and Sino-Korean vocabularies. Our graphs are formed by a set of 117,000 Chinese words, and a set of 135,000…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
