PSDVec: a Toolbox for Incremental and Scalable Word Embedding

Shaohua Li; Jun Zhu; Chunyan Miao

arXiv:1606.03192·cs.CL·July 5, 2016

PSDVec: a Toolbox for Incremental and Scalable Word Embedding

Shaohua Li, Jun Zhu, Chunyan Miao

PDF

TL;DR

PSDVec is a scalable, incremental toolbox for learning word embeddings using a weighted low-rank positive semidefinite approximation, enabling efficient updates and superior performance on benchmarks.

Contribution

It introduces a novel blockwise online learning algorithm for scalable, incremental word embedding training that outperforms existing tools.

Findings

01

Achieves the best average performance on benchmark sets.

02

Reduces learning time for large vocabularies.

03

Enables incremental learning of new words without full retraining.

Abstract

PSDVec is a Python/Perl toolbox that learns word embeddings, i.e. the mapping of words in a natural language to continuous vectors which encode the semantic/syntactic regularities between the words. PSDVec implements a word embedding learning method based on a weighted low-rank positive semidefinite approximation. To scale up the learning process, we implement a blockwise online learning algorithm to learn the embeddings incrementally. This strategy greatly reduces the learning time of word embeddings on a large vocabulary, and can learn the embeddings of new words without re-learning the whole vocabulary. On 9 word similarity/analogy benchmark sets and 2 Natural Language Processing (NLP) tasks, PSDVec produces embeddings that has the best average performance among popular word embedding tools. PSDVec provides a new option for NLP practitioners.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.