Word-Graph2vec: An efficient word embedding approach on word co-occurrence graph using random walk technique
Wenting Li, Jiahong Xue, Xi Zhang, Huacan Chen, Zeyu Chen, and Feijuan Huang, Yuanzhe Cai

TL;DR
Word-Graph2vec is a graph-based word embedding method that efficiently generates embeddings from large corpora by sampling from a word co-occurrence graph, outperforming traditional methods in speed and accuracy.
Contribution
The paper introduces Word-Graph2vec, a novel graph-based approach that improves efficiency and stability of word embedding training on large datasets using random walk sampling.
Findings
Outperforms Word2vec by 4-5 times in efficiency
Surpasses FastText by 2-3 times in efficiency
Maintains small error with random walk sampling
Abstract
Word embedding has become ubiquitous and is widely used in various natural language processing (NLP) tasks, such as web retrieval, web semantic analysis, and machine translation, and so on. Unfortunately, training the word embedding in a relatively large corpus is prohibitively expensive. We propose a graph-based word embedding algorithm, called Word-Graph2vec, which converts the large corpus into a word co-occurrence graph, then takes the word sequence samples from this graph by randomly traveling and trains the word embedding on this sampling corpus in the end. We posit that because of the limited vocabulary, huge idioms, and fixed expressions in English, the size and density of the word co-occurrence graph change slightly with the increase in the training corpus. So that Word-Graph2vec has stable runtime on the large-scale data set, and its performance advantage becomes more and more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies
