Using data-compressors for statistical analysis of problems on homogeneity testing and classification
Boris Ryabko, Andrey Guskov, Irina Selivanova

TL;DR
This paper demonstrates how data compressors can be utilized to develop classical statistical methods for homogeneity testing and classification, bridging the gap between data compression and statistical analysis.
Contribution
It introduces a novel approach to apply data compressors within the framework of mathematical statistics for homogeneity testing and classification.
Findings
Data compressors can be effectively used for statistical analysis.
Classical statistical methods can be reformulated using data compression techniques.
The approach bridges the gap between text analysis and mathematical statistics.
Abstract
Nowadays data compressors are applied to many problems of text analysis, but many such applications are developed outside of the framework of mathematical statistics. In this paper we overcome this obstacle and show how several methods of classical mathematical statistics can be developed based on applications of the data compressors.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Using data-compressors for statistical analysis of problems on
homogeneity testing and classification
Boris Ryabko
Institute of Computational Technologies of SB RAS
Novosibirsk, Russian Federation
Andrey Guskov
The State Public Scientific Technological
Library of SB RAS, Novosibirsk, Russian Federation
Irina Selivanova
The State Public Scientific Technological
Library of SB RAS, Novosibirsk, Russian Federation
Abstract
Nowadays data compressors are applied to many problems of text analysis, but many such applications are developed outside of the framework of mathematical statistics. In this paper we overcome this obstacle and show how several methods of classical mathematical statistics can be developed based on applications of the data compressors.
keywords: data compression, hypothesis testing, homogeneity test, classification, universal code.
I Introduction
Data compression methods (or universal codes) were discovered in the 1960’s and nowadays they are widely used to compress texts for their storage or transmission. In the last thirty years, it was recognized that data compressors can be used for many purposes which are far from file compaction. In particular, it was shown that methods of data compression can be used for prediction and hypothesis testing for time series in the framework of classical mathematical statistics, see [15] and a review there. Later, several authors applied data compressors to problems which are close, in spirit, to homogeneity testing, estimation of correlation and covariance, classifications, clustering and some others; see [10, 3, 4, 11, 18]. The main idea of their approach can be understood from the following example. Suppose that there are three sequences of letters , , and a certain data-compressor . The sequences , obey different probability distributions, whereas obeys one of them. The goal is to determine this distribution. (It is the well-known “three samples problem”.) If and obey the same probability distribution, then, the sequence will be compressed better after than after . More precisely, if one compresses sequences , and combined ones and , the difference will be less than (Here is the length .) For instance, let , be texts in English, whereas is in German. Then the English text will be compressed better after the text in the same language () than after the text in German , i.e. the first difference will be less than the second one.
This natural approach was used for diagnostic of the authorship of literary and musical texts, for estimation of closeness of DNA sequences, construction of phylogenetic trees and many other problems ([10, 3, 4, 11, 18, 19, 6]). Many papers (see [11] and review there) were devoted to the measurement of the interdependence between sequences (or the association, similarity, closeness, etc.), because such measures plays an important rule in clustering, classification and some other methods of text analysis. It is important to note, that their approaches are outside of the framework of mathematical statistics and, in particular, do not give a possibility to reason about consistency of estimates, tests, classifiers, clustering, etc.
It is important that the modern data compressors are based on so-called universal codes and the main properties of universal codes are valid for them (as far as asymptotic properties can be valid for a real computer program). A formal definition of universal codes is given in Appendix 1, but here we informally note that universal codes can compress sequences generated by a source with unknown statistics till its Shannon entropy, which, in turn, is a lower limit on lossless compression. Note that nowadays there are many classes of universal codes which are based on different ideas and showed their practical efficiency. Out of the most popular we mention the PPM algorithm, which is used along with the arithmetic code ([1, 14]), Lempel-Ziv codes ([20, 21]), Borrows-Willer transformation ([2, 12]) which is used along with the “book stack” (or MTF) code ([17, 16]) and some others (see for review ([13, 15]).
In this paper we show how the idea of compression of combined texts described above can be used for solving problems of homogeneity testing, classification, and estimation of a measure of interdependence, or the association. A distinction of the suggested method from other approaches is that it belongs to the framework of mathematical statistics.
II Definitions and problem formulations
First we briefly consider the main properties of so-called universal codes, paying the main attention to their meaning whereas formal definitions of codes, stationary ergodic sources and Shannon entropy are given in Appendix 1 and can be found in [5, 15].
Here we only note that we consider so-called lossless codes which encode words over alphabet using words from the binary alphabet in such a way that any word can be decoded without mistakes, i.e. there exists a map such that for any word over . Let a code be applied to encode sequences generated by a stationary ergodic source . The value is called the redundancy, where is the limit entropy, see Appendix 1. (It is known that is an attainable lower limit for lossless codes, that is why the difference is called the redundancy.) By definition, the code is called universal if the redundancy goes to 0 as grows, i.e. (In other words, a universal code compresses sequences generated by any stationary ergodic source till the limit value.)
The main goal of the paper is to give a compression-based solution for the following problems:
i) Homogeneity test, where there are several sequences , , , , , generated either by a single source or by two different ones, and two corresponding hypotheses. We also consider the more general case where there are more than two different sets of sequences.
ii) Classification problems, where there are samples , , , , generated two different (but unknown) sources and is generated by one of the two. The goal is to determine which of them generated .
iii) Estimation of a so-called measurement of interdependence, or the association.
III The main theorems
First we present a theorem which can be considered as a theoretical basis for application of data compressors for solving the problems described above. First, for two words and we define the following value:
[TABLE]
where are integers. Informally, it is the length of the codeword for if it is encoded with the word .
Theorem 1**.**
Let and be stationary ergodic sources generating letters from a finite alphabet and let their memory be upper-bounded by a certain constant . Suppose that is generated by , whereas and are generated by in such a way that all three sequences are independent, and let be a universal code. Define
[TABLE]
[TABLE]
where is defined in (1). Then there exists such a constant and an integer that for any there is such an integer that, for any , ,
[TABLE]
and, if then , otherwise . Here is the expectation with respect to the measure .
The proof is given in Appendix 2, whereas here we give some explanations of the theorem. The theorem compares two cases: the word is compressed either together with or with . The word is generated by , hence, it should be compressed better with and the value should be less than . The theorem shows, that, indeed, the word is compressed better together with if the lengths of sequences , and are sufficiently large.
Let there be two sets of sequences: , , , , and , , , generated by possibly different measures and . We consider two hypotheses and and our goal is to develop a statistical test for them using the sets and . First we give an informal description of the suggested test, which will be based on data compression. Combine a half of the sequences from the set into (say, , , … , ) and half of (say and , , )) into . Then compress all other sequences using a universal code along with and . If is true, the values and should be evenly mixed. Otherwise, if is true, then, on average, the numbers from the second set should be larger than those from the first one, because the probability distributions of the sequences from and are different, whereas the probability distributions of the sequences and are the same (here and below is the number of elements in ).
The test. A more formal description of the suggested test is as follows: i) Denote the set by and by and calculate for any from the values
[TABLE]
and for any from
[TABLE]
Define
[TABLE]
[TABLE]
ii) Apply the test of the independence for the table to
[TABLE]
A detailed analysis of this problem is carried out in [7], part 33. In particular, there is a description of efficient tests for homogeneity problem for the table (see part 33.22). We denote this test as , where is the level of significance. Note, that there are some requirements for values of which should be valid if is used (see [7]).
Theorem 2**.**
Let and be stationary ergodic measures whose memory is finite (but, possibly, unknown). If the above described test is applied for testing against along with the test and the requirements of for values of are valid, then for any code the Type I error is not grater than .
If is a universal code, and go to infinity and the lengths of all sequences from and go to infinity in such a way that for all sequences , the ratios and are upper-bounded by a certain constant, then, with probability 1, the Type II error goes to 0.
The proof is given in Appendix 2, but here we give some comments. First, note that there are many tests of homogeneity for tables and, in principle, any of them can be used. That is why, we do not describe the test on sizes of . Second, the described method can be easily extended from the two-sample problem to the -sample problem, . Namely, let there be sets , of sequences generated by stationary ergodic sources . Let there be the hypotheses and . In this case we carry out calculations based on the scheme described above in order to obtain a so-called table and then apply a test for homogeneity, see ([7]).
Measurement of the interdependence and association. If the hypothesis of homogeneity is rejected, it is natural to measure interdependence. We suggest to measure interdependence between two sets of sequences and (and the corresponding sources) based on the above described tables. (In the case of the -sample problem, , the measures will be based on the table.) This problem is well-investigated in the mathematical statistics, see, for example, ([7], part 33). That is why, we mention such measures only briefly. For tables we mention the coefficient of association, , defined by the equation and the coefficient see, ([7], part 33). It is important to note that there are well-known methods of building standard errors and confidence interval for and ([7], part 33).
Let there be sequences , , , generated by stationary ergodic sources , , correspondingly, where . There is a new sequence , , and it is known beforehand that it is generated by one of the sources from , . The problem of classification is to determine which source generated the sequence . By definition, a method of classification is called asymptotically consistent if, with probability 1, the method finds which generated the sequence when goes to infinity.
Method of classification. We suggest the following method of classification: decide that the sequence was generated the source for which
[TABLE]
Theorem 3**.**
Let , be stationary ergodic measures whose memories are finite (but, possibly, unknown). If is a universal code and the lengths of all sequences go to infinity in such a way that
[TABLE]
then the described method is asymptotically consistent.
IV Appendix 1
Let be a stationary ergodic source generating letters from a finite alphabet . (Definitions can be found in [5].) The order (conditional) Shannon entropy and the limit Shannon entropy are defined as follows:
[TABLE]
[TABLE]
It is known that for any integer
[TABLE]
see [5]. Now we define codes. Let be the set of all infinite words over the alphabet . A data compression method (or code) is defined as a set of mappings such that and for each pair of different words Informally, it means that the code can be applied for compression of each message of any length over the alphabet and the message can be decoded if its code is known. It is also required that each sequence of encoded words from the set can be uniquely decoded into . Such codes are called uniquely decodable. For example, let , the code obviously, is not uniquely decodable. It is well known that if a code is uniquely decodable then the lengths of the codewords satisfy the following inequality (Kraft inequality): see, for ex., [5]. Moreover, if the sum is less than 1, there exists such a code that i) and for any word , [5]. (Informally, it means, that compresses better.) So, we can consider only codes for which
[TABLE]
We will use a so-called Kullback-Leibler (KL) divergence, which is defined by
[TABLE]
where and are probability distributions over alphabet . It is known that for any distributions and the KL divergence is nonnegative and equals [math] if and only if for all .
Let us describe universal codes, or data compressors. (A detailed description can be found, for example, in [15].) First we note that (as is known in Information Theory) sequences generated by a source can be ”compressed” on average till the limit Shannon entropy and, on the other hand, there is no code for which the expected codeword length is less than ; see [5]. As defined above, a code is universal if it compresses till this limit. We will consider universal codes which can be applied to any word , , from a certain alphabet and such that the following natural property is valid: for any words ,
[TABLE]
For any universal code we define a measure as follows:
[TABLE]
where is a word over . It is known that for any
[TABLE]
(see the Kraft inequality and (11)).
Now we consider an application of universal codes (or data compressors) which is the main tool for solving the problems mentioned in Introduction. From (1) and (14) we immediately obtain that
[TABLE]
and the right part is defined correctly, see (13). The following property of universal codes will play an important role in what follows. Let be a stationary ergodic source generating letters from an alphabet and be a universal code. For any integer and with probability 1
[TABLE]
Note that this equation shows that the code estimates the (unknown) probability precisely, where grows.
V Appendix 2: proofs
Proof of Theorem 1. First we prove the following
Claim.
[TABLE]
[TABLE]
where is a upper bound of memories of and , and is such a constant that if and otherwise.
Proof of the claim. The left side can be presented as follows:
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
It is supposed that the memory of and is upper-bounded by . Hence, by definition, it means that where . From this equation and the previous one we obtain the following equation:
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
For any function and any measure
[TABLE]
[TABLE]
[TABLE]
where and are integers. Having taken into account this equation, stationarity and , equations (V), (V) and (V), we obtain
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
From properties of K-L divergence (see (12) ) we can see that if and, obviously, , if . The claim is proven.
Let us proceed with proof of the theorem. Having taken into account the definitions (16) and (1), we obtain that
[TABLE]
[TABLE]
[TABLE]
[TABLE]
The following equation is obvious:
[TABLE]
[TABLE]
[TABLE]
[TABLE]
The first term is estimated in the claim, see (V), whereas the second and the third terms can be estimated based on (17). So, from the claim and (17), we can see that, with probability 1, The theorem is proven.
Proof of the Theorem 2. First we consider the case where is true. It means that the sequences from and obey the same distribution. Hence, (4) and (5) have the same distribution, too, and the above mentioned test from [7], part 33, can be applied. Now we consider the case where is true. In this case the length of any sequence grows, so, the length will be grater than from Theorem 1. The number of sequences grows to infinity and the total length of a half of them goes to infinity in such a way that for any integer the total length will be grater than the sum from Theorem 1. From this theorem we can see that and goes to 0 and, hence, the Type II error goes to 0.
Proof of the Theorem 3. Suppose, that the sequence was generated by . Then, we can see from Theorem 1 that, with probability 1, the value grows as , , if , (i.e. is generated by ). On the other hand, Hence, is minimal when (i.o., is generated by ). The theorem is proven.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Bell, T., Witten, I. H., Cleary, J. G., 1989. Modeling for text compression. ACM Computing Surveys (CSUR) 21 (4), 557–591.
- 2[2] Burrows, M., Wheeler, D. J., 1994. A block-sorting lossless data compression algorithm.
- 3[3] Cilibrasi, R., Vitanyi, P., 2005. Clustering by compression. IEEE Transactions on Information Theory 51, 1523–1545.
- 4[4] Cilibrasi, R., Vitányi, P., De Wolf, R., 2004. Algorithmic clustering of music based on string compression. Computer Music Journal 28 (4), 49–67.
- 5[5] Cover, T. M., Thomas, J. A., 2006. Elements of information theory. Wiley-Interscience, New York, NY, USA.
- 6[6] Ferragina, P., Giancarlo, R., Greco, V., Manzini, G., Valiente, G., 2007. Compression-based classification of biological sequences and structures via the universal similarity metric: experimental assessment. BMC bioinformatics 8 (1), 1.
- 7[7] Kendall, M., Stuart, A., 1961. The advanced theory of statistics; Vol.2: Inference and relationship. London.
- 8[8] Khmelev, D. V., Teahan, W. J., 2003. A repetition based measure for verification of text collections and for text categorization. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval. ACM, pp. 104–110.
