On the number of k-skip-n-grams
Dmytro Krasnoshtan

TL;DR
This paper derives a mathematical formula to precisely count the number of k-skip-n-grams in a text corpus, which is useful for understanding their distribution and application in NLP tasks.
Contribution
It provides a closed-form expression for the number of k-skip-n-grams, advancing the theoretical understanding of n-gram sampling methods in natural language processing.
Findings
Derived a formula for counting k-skip-n-grams
The formula accounts for corpus length and skip parameters
Facilitates more accurate analysis of skip-gram models
Abstract
The paper proves that the number of k-skip-n-grams for a corpus of size is where .
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · semigroups and automata theory · Advanced Combinatorial Mathematics
On the number of k-skip-n-grams
Dmytro Krasnoshtan
Abstract
The paper proves that the number of k-skip-n-grams for a corpus of size is
[TABLE]
where .
K****eywords NLP, skip-grams
1 Introduction
Skip-gram [1] is a popular technique used in natural language processing, where in addition to sequences of words, we allow to substitute a word with a skip token. The model is used to overcome the data sparsity problem and provides an efficient method for learning high-quality vector representations for phrases.
Guthrie et al. further investigated the use of skip-grams by introducing k-skip-n-grams [2] and empirically shown that they can be more effective than increasing the size of the training corpus. In their paper, they also provided the following formula for calculating the number of k-skip-trigrams () for a corpus of size :
[TABLE]
The purpose of this paper is to derive the general case formula for arbitrary , , and .
2 Proof
The proof of the general formula can be derived from the algorithm of constructing the k-skip-n-grams. There are a few recursive algorithms to construct them, but the one that makes the counting easier relies on the following intuition:
The number of k-skip-n-grams is equal to the sum of the number of n-grams with 0 skips plus the number of n-grams with exactly 1 skip plus the number of n-grams with exactly 2 skips plus so on till the number of n-grams with exactly k skips. So if we number of n-grams with exactly skips is , then the total number of all k-skip-n-grams is .
To derive the formula for , let’s see how we can generate an n-gram with exactly skips. One can notice that generating n-grams with skips is equivalent of selecting a sequence of length and substituting any element with skips. It is important to realize is that you can’t substitute the first or the last element, as this n-gram will be equivalent to
- •
(k-1)-skip-n-gram if you substitute only one (first or last) element with a skip
- •
(k-2)-skip-n-gram if you substitute both (first and last) elements with a skip
So we need to choose substitutions from positions which can be done in different ways. Because we can generate (should be ) different substrings of length from the corpus of size , the total number of n-grams with exactly skips is
[TABLE]
Therefore the total formula for k-skip-n-grams is
[TABLE]
This expression can be simplified using the following identities:
- •
can be proved by induction
- •
can be proved by induction
- •
can be proved from the definition of binomial
So
[TABLE]
The formula is almost complete apart of a few corner cases. If , we do not select any n-grams and the result should be zero. Previously it was also mentioned that , which is the same as
3 Additional materials
The code and verification for the formula are available at https://github.com/salvador-dali/k-skip-n-gram
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. ICLR Workshop, 2013 .
- 2[2] David Guthrie, Ben Allison, Wei Liu, Louise Guthrie and Yorick Wilk. A Closer Look at Skip-gram Modelling. Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC), 2016 .
