How to Generate a Good Word Embedding?
Siwei Lai, Kang Liu, Liheng Xu, Jun Zhao

TL;DR
This paper systematically analyzes key factors in training effective word embeddings, emphasizing corpus domain importance, model complexity, and early stopping criteria, providing practical guidelines for improved embedding quality.
Contribution
It offers a comprehensive comparison of neural-network-based word embedding algorithms and establishes practical training guidelines based on empirical analysis.
Findings
Corpus domain impacts embedding quality more than size.
Faster models perform adequately for most tasks.
Early stopping should be based on task-specific development sets.
Abstract
We analyze three critical components of word embedding training: the model, the corpus, and the training parameters. We systematize existing neural-network-based word embedding algorithms and compare them using the same corpus. We evaluate each word embedding in three ways: analyzing its semantic properties, using it as a feature for supervised tasks and using it to initialize neural networks. We also provide several simple guidelines for training word embeddings. First, we discover that corpus domain is more important than corpus size. We recommend choosing a corpus in a suitable domain for the desired task, after that, using a larger corpus yields better results. Second, we find that faster models provide sufficient performance in most cases, and more complex models can be used if the training corpus is sufficiently large. Third, the early stopping metric for iterating should rely on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques
MethodsEarly Stopping
