On the number of k-skip-n-grams

Dmytro Krasnoshtan

arXiv:1905.05407·cs.CL·May 15, 2019

On the number of k-skip-n-grams

Dmytro Krasnoshtan

PDF

Open Access 1 Repo

TL;DR

This paper derives a mathematical formula to precisely count the number of k-skip-n-grams in a text corpus, which is useful for understanding their distribution and application in NLP tasks.

Contribution

It provides a closed-form expression for the number of k-skip-n-grams, advancing the theoretical understanding of n-gram sampling methods in natural language processing.

Findings

01

Derived a formula for counting k-skip-n-grams

02

The formula accounts for corpus length and skip parameters

03

Facilitates more accurate analysis of skip-gram models

Abstract

The paper proves that the number of k-skip-n-grams for a corpus of size $L$ is $\frac{L n + n + k ^{'} - n ^{2} - n k ^{'}}{n} \cdot (n - 1 n - 1 + k ^{'})$ where $k^{'} = min (L - n + 1, k)$ .

Equations10

\frac{L n + n + k ^{'} - n ^{2} - n k ^{'}}{n} \cdot (n - 1 n - 1 + k ^{'})

\frac{L n + n + k ^{'} - n ^{2} - n k ^{'}}{n} \cdot (n - 1 n - 1 + k ^{'})

\frac{( k + 1 ) ( k + 2 )}{6} (3 L - 2 k - 6)

\frac{( k + 1 ) ( k + 2 )}{6} (3 L - 2 k - 6)

f (L, n, k) = (k n + k - 2) \cdot (L - n - k + 1) = (n - 2 n + k - 2) \cdot (L - n - k + 1)

f (L, n, k) = (k n + k - 2) \cdot (L - n - k + 1) = (n - 2 n + k - 2) \cdot (L - n - k + 1)

A = i = 0 \sum k (n - 2 n + i - 2) \cdot (L - n - i + 1)

A = i = 0 \sum k (n - 2 n + i - 2) \cdot (L - n - i + 1)

A = i = 0 \sum k (n - 2 n - 2 + i) \cdot (L - n + 1) - i = 0 \sum k i (n - 2 n - 2 + i) = = (n - 1 n - 1 + k) \cdot (L - n + 1) - \frac{k ( k + 1 )}{n} (n - 2 n - 1 + k) = = (n - 1 n - 1 + k) \cdot (L - n + 1) - \frac{k ( n - 1 )}{n} \cdot (n - 1 n - 1 + k) = = \frac{L n + n + k - n ^{2} - k n}{n} \cdot (n - 1 n - 1 + k)

A = i = 0 \sum k (n - 2 n - 2 + i) \cdot (L - n + 1) - i = 0 \sum k i (n - 2 n - 2 + i) = = (n - 1 n - 1 + k) \cdot (L - n + 1) - \frac{k ( k + 1 )}{n} (n - 2 n - 1 + k) = = (n - 1 n - 1 + k) \cdot (L - n + 1) - \frac{k ( n - 1 )}{n} \cdot (n - 1 n - 1 + k) = = \frac{L n + n + k - n ^{2} - k n}{n} \cdot (n - 1 n - 1 + k)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

salvador-dali/k-skip-n-gram
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · semigroups and automata theory · Advanced Combinatorial Mathematics

Full text

On the number of k-skip-n-grams

Dmytro Krasnoshtan

[email protected]

Abstract

The paper proves that the number of k-skip-n-grams for a corpus of size $L$ is

[TABLE]

where $k^{\prime}=\min(L-n+1,k)$ .

K****eywords NLP, skip-grams

1 Introduction

Skip-gram [1] is a popular technique used in natural language processing, where in addition to sequences of words, we allow to substitute a word with a skip token. The model is used to overcome the data sparsity problem and provides an efficient method for learning high-quality vector representations for phrases.

Guthrie et al. further investigated the use of skip-grams by introducing k-skip-n-grams [2] and empirically shown that they can be more effective than increasing the size of the training corpus. In their paper, they also provided the following formula for calculating the number of k-skip-trigrams ( $n=3$ ) for a corpus of size $L$ :

[TABLE]

The purpose of this paper is to derive the general case formula for arbitrary $L$ , $n$ , and $k$ .

2 Proof

The proof of the general formula can be derived from the algorithm of constructing the k-skip-n-grams. There are a few recursive algorithms to construct them, but the one that makes the counting easier relies on the following intuition:

The number of k-skip-n-grams is equal to the sum of the number of n-grams with 0 skips plus the number of n-grams with exactly 1 skip plus the number of n-grams with exactly 2 skips plus so on till the number of n-grams with exactly k skips. So if we number of n-grams with exactly $k$ skips is $f(L,n,k)$ , then the total number of all k-skip-n-grams is $\sum_{i=0}^{k}f(L,n,i)$ .

To derive the formula for $f$ , let’s see how we can generate an n-gram with exactly $k$ skips. One can notice that generating n-grams with $k$ skips is equivalent of selecting a sequence of length $n+k$ and substituting any $k$ element with skips. It is important to realize is that you can’t substitute the first or the last element, as this n-gram will be equivalent to

•

(k-1)-skip-n-gram if you substitute only one (first or last) element with a skip

•

(k-2)-skip-n-gram if you substitute both (first and last) elements with a skip

So we need to choose $k$ substitutions from $n+k-2$ positions which can be done in $\binom{n+k-2}{k}$ different ways. Because we can generate $L-n-k+1$ (should be $>0$ ) different substrings of length $n+k$ from the corpus of size $L$ , the total number of n-grams with exactly $k$ skips is

[TABLE]

Therefore the total formula for k-skip-n-grams is

[TABLE]

This expression can be simplified using the following identities:

•

$\sum_{i=0}^{k}\binom{a+i}{a}=\binom{a+k+1}{a+1}$ can be proved by induction

•

$\sum_{i=0}^{k}i\binom{a+i}{a}=\frac{k(k+1)}{a+2}\binom{a+k+1}{a}$ can be proved by induction

•

$\binom{n+k}{n}=\frac{k+1}{n}\binom{n+k}{n-1}$ can be proved from the definition of binomial

So

[TABLE]

The formula is almost complete apart of a few corner cases. If $n=0$ , we do not select any n-grams and the result should be zero. Previously it was also mentioned that $L-n-k+1>0$ , which is the same as $k=\min(L-n+1,k)$

3 Additional materials

The code and verification for the formula are available at https://github.com/salvador-dali/k-skip-n-gram

Bibliography2

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. ICLR Workshop, 2013 .
2[2] David Guthrie, Ben Allison, Wei Liu, Louise Guthrie and Yorick Wilk. A Closer Look at Skip-gram Modelling. Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC), 2016 .