Word-Graph2vec: An efficient word embedding approach on word   co-occurrence graph using random walk technique

Wenting Li; Jiahong Xue; Xi Zhang; Huacan Chen; Zeyu Chen; and Feijuan Huang; Yuanzhe Cai

arXiv:2301.04312·cs.CL·December 29, 2023

Word-Graph2vec: An efficient word embedding approach on word co-occurrence graph using random walk technique

Wenting Li, Jiahong Xue, Xi Zhang, Huacan Chen, Zeyu Chen, and Feijuan Huang, Yuanzhe Cai

PDF

Open Access

TL;DR

Word-Graph2vec is a graph-based word embedding method that efficiently generates embeddings from large corpora by sampling from a word co-occurrence graph, outperforming traditional methods in speed and accuracy.

Contribution

The paper introduces Word-Graph2vec, a novel graph-based approach that improves efficiency and stability of word embedding training on large datasets using random walk sampling.

Findings

01

Outperforms Word2vec by 4-5 times in efficiency

02

Surpasses FastText by 2-3 times in efficiency

03

Maintains small error with random walk sampling

Abstract

Word embedding has become ubiquitous and is widely used in various natural language processing (NLP) tasks, such as web retrieval, web semantic analysis, and machine translation, and so on. Unfortunately, training the word embedding in a relatively large corpus is prohibitively expensive. We propose a graph-based word embedding algorithm, called Word-Graph2vec, which converts the large corpus into a word co-occurrence graph, then takes the word sequence samples from this graph by randomly traveling and trains the word embedding on this sampling corpus in the end. We posit that because of the limited vocabulary, huge idioms, and fixed expressions in English, the size and density of the word co-occurrence graph change slightly with the increase in the training corpus. So that Word-Graph2vec has stable runtime on the large-scale data set, and its performance advantage becomes more and more…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies