Word2vec Skip-gram Dimensionality Selection via Sequential Normalized   Maximum Likelihood

Pham Thuc Hung; Kenji Yamanishi

arXiv:2008.07720·cs.LG·August 26, 2020·5 cites

Word2vec Skip-gram Dimensionality Selection via Sequential Normalized Maximum Likelihood

Pham Thuc Hung, Kenji Yamanishi

PDF

Open Access

TL;DR

This paper introduces an information criteria-based method, especially using Sequential Normalized Maximum Likelihood (SNML), to select the optimal dimensionality for word2vec Skip-gram models, improving model accuracy and efficiency.

Contribution

It proposes a novel SNML-based approach for dimensionality selection in word2vec Skip-gram models, with heuristics for efficient computation and empirical validation of its superiority.

Findings

01

SNML outperforms BIC and AIC in dimensionality selection.

02

Selected dimensionality by SNML aligns closely with optimal values from word similarity tasks.

03

Heuristics enable practical computation of SNML for large datasets.

Abstract

In this paper, we propose a novel information criteria-based approach to select the dimensionality of the word2vec Skip-gram (SG). From the perspective of the probability theory, SG is considered as an implicit probability distribution estimation under the assumption that there exists a true contextual distribution among words. Therefore, we apply information criteria with the aim of selecting the best dimensionality so that the corresponding model can be as close as possible to the true distribution. We examine the following information criteria for the dimensionality selection problem: the Akaike Information Criterion, Bayesian Information Criterion, and Sequential Normalized Maximum Likelihood (SNML) criterion. SNML is the total codelength required for the sequential encoding of a data sequence on the basis of the minimum description length. The proposed approach is applied to both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis