Word2vec Skip-gram Dimensionality Selection via Sequential Normalized Maximum Likelihood
Pham Thuc Hung, Kenji Yamanishi

TL;DR
This paper introduces an information criteria-based method, especially using Sequential Normalized Maximum Likelihood (SNML), to select the optimal dimensionality for word2vec Skip-gram models, improving model accuracy and efficiency.
Contribution
It proposes a novel SNML-based approach for dimensionality selection in word2vec Skip-gram models, with heuristics for efficient computation and empirical validation of its superiority.
Findings
SNML outperforms BIC and AIC in dimensionality selection.
Selected dimensionality by SNML aligns closely with optimal values from word similarity tasks.
Heuristics enable practical computation of SNML for large datasets.
Abstract
In this paper, we propose a novel information criteria-based approach to select the dimensionality of the word2vec Skip-gram (SG). From the perspective of the probability theory, SG is considered as an implicit probability distribution estimation under the assumption that there exists a true contextual distribution among words. Therefore, we apply information criteria with the aim of selecting the best dimensionality so that the corresponding model can be as close as possible to the true distribution. We examine the following information criteria for the dimensionality selection problem: the Akaike Information Criterion, Bayesian Information Criterion, and Sequential Normalized Maximum Likelihood (SNML) criterion. SNML is the total codelength required for the sequential encoding of a data sequence on the basis of the minimum description length. The proposed approach is applied to both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
