Synonym Discovery with Etymology-based Word Embeddings

Seunghyun Yoon; Pablo Estrada; Kyomin Jung

arXiv:1709.10445·cs.CL·December 13, 2017

Synonym Discovery with Etymology-based Word Embeddings

Seunghyun Yoon, Pablo Estrada, Kyomin Jung

PDF

TL;DR

This paper introduces a new method for learning word embeddings based on etymological roots, which is especially effective for logographic languages and does not require large text corpora.

Contribution

The paper presents a novel etymology-based embedding model that leverages etymological graphs, suitable for languages with logographic writing systems, and demonstrates its effectiveness in synonym discovery.

Findings

01

Effective in Chinese and Sino-Korean vocabularies

02

Performs well in synonym discovery tasks

03

Requires only etymological data, not large corpora

Abstract

We propose a novel approach to learn word embeddings based on an extended version of the distributional hypothesis. Our model derives word embedding vectors using the etymological composition of words, rather than the context in which they appear. It has the strength of not requiring a large text corpus, but instead it requires reliable access to etymological roots of words, making it specially fit for languages with logographic writing systems. The model consists on three steps: (1) building an etymological graph, which is a bipartite network of words and etymological roots, (2) obtaining the biadjacency matrix of the etymological graph and reducing its dimensionality, (3) using columns/rows of the resulting matrices as embedding vectors. We test our model in the Chinese and Sino-Korean vocabularies. Our graphs are formed by a set of 117,000 Chinese words, and a set of 135,000…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.