TL;DR
This paper introduces a nearly linear time algorithm for approximating the profile maximum likelihood (PML) distribution, enabling efficient universal estimation of symmetric properties of distributions with broad applications.
Contribution
It provides the first polynomial-time algorithm for approximate PML computation, facilitating universal symmetric property estimation in nearly linear time.
Findings
Algorithm computes approximate PML within exponential multiplicative error.
Enables universal plug-in estimators for all symmetric functions with high accuracy.
Extends to polynomial-time algorithms for multi-dimensional PML for symmetric relationships.
Abstract
Estimating symmetric properties of a distribution, e.g. support size, coverage, entropy, distance to uniformity, are among the most fundamental problems in algorithmic statistics. While each of these properties have been studied extensively and separate optimal estimators are known for each, in striking recent work, Acharya et al. 2016 showed that there is a single estimator that is competitive for all symmetric properties. This work proved that computing the distribution that approximately maximizes \emph{profile likelihood (PML)}, i.e. the probability of observed frequency of frequencies, and returning the value of the property on this distribution is sample competitive with respect to a broad class of estimators of symmetric properties. Further, they showed that even computing an approximation of the PML suffices to achieve such a universal plug-in estimator. Unfortunately, prior toâŠ
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Efficient Profile Maximum Likelihood for Universal Symmetric Property Estimation· youtube
Efficient Profile Maximum Likelihood for
Universal Symmetric Property Estimation
Moses Charikar
Stanford University
[email protected] Supported by NSF grant CCF-1617577, a Simons Investigator Award and a Google Faculty Research Award. ââ
Kirankumar Shiragur
Stanford University
ââ
Aaron Sidford
Stanford University
[email protected] Supported by NSF CAREER Award CCF-1844855.
Abstract
Estimating symmetric properties of a distribution, e.g. support size, coverage, entropy, distance to uniformity, are among the most fundamental problems in algorithmic statistics. While each of these properties have been studied extensively and separate optimal estimators are known for each, in striking recent work, Acharya et al. [ADOS16] showed that there is a single estimator that is competitive for all symmetric properties. This work proved that computing the distribution that approximately maximizes profile likelihood (PML), i.e. the probability of observed frequency of frequencies, and returning the value of the property on this distribution is sample competitive with respect to a broad class of estimators of symmetric properties. Further, they showed that even computing an approximation of the PML suffices to achieve such a universal plug-in estimator. Unfortunately, prior to this work there was no known polynomial time algorithm to compute an approximate PML and it was open to obtain a polynomial time universal plug-in estimator through the use of approximate PML.
In this paper we provide a algorithm (in number of samples) that, given samples from a distribution, computes an approximate PML distribution up to a multiplicative error of in time nearly linear in . Generalizing work of [ADOS16] on the utility of approximate PML we show that our algorithm provides a nearly linear time universal plug-in estimator for all symmetric functions up to accuracy . Further, we show how to extend our work to provide efficient polynomial-time algorithms for computing a -dimensional generalization of PML (for constant ) that allows for universal plug-in estimation of symmetric relationships between distributions.
1 Introduction
Estimating a symmetric property of a distribution given a small number of samples is a fundamental problem in algorithmic statistics. Formally, a property is symmetric if it is invariant to permutation of the labels, i.e. it is a function only of the multiset of probabilities and does not depend on the symbol labels. For many natural properties, including support size, coverage, distance from uniform and entropy, there has been extensive work that has led to designing efficient estimators both with respect to computational time and sample complexity [HJWW17, HJM17, AOST14, RVZ17, ZVV*+*16, WY16b, RRSS07, WY15, OSW16, VV11b, WY16a, JVHW15, JHW16, VV11a]. In many cases these estimators are tailored to the particular property of interest. This paper is motivated by the goals of unifying the development of efficient estimators of symmetric properties of distributions and designing a single efficient universal algorithm for estimating arbitrary symmetric properties of distributions.
Our approach stems from the observation that a sufficient statistic for the problem of estimating a symmetric property from a sequence of samples is the profile of the sequence, i.e. the multiset of the frequencies (i.e multiplicities) of symbols in the sequence, e.g. the profile of is . Profiles are also called histograms of histograms, histogram order statistics, or fingerprints. Our approach to obtaining a universal estimator is based on the elegant problem of profile maximum likelihood (PML) introduced by Orlitsky et al. [OSS*+*04]: Given a sequence of samples, find the distribution that maximizes the probability of the observed profile. This problem has been studied in several papers since, applying heuristic approaches such as Bethe approximation [Von12, Von14], the EM algorithm [OSS*+*04], and some algebraic approaches [ADM*+*10] to calculate the PML. Recently Pavlichin, Jiao and Weissman [PJW17] introduced an efficient dynamic programming heuristic for PML that can be computed in linear time. While there are no approximation guarantees for the solution they produce, their approach was the initial impetus for our work.
A recent paper of Acharya et al. [ADOS16] showed that a distribution that optimizes the PML objective can be used to obtain a plug-in estimator for various symmetric properties of distributions. In fact it suffices to compute a distribution that approximates the PML objective to within a factor for constant where is the size of the sample. Unfortunately, no polynomial time computable PML estimator with such an approximation guarantee was known previously. In this paper, we provide an estimator with an approximation factor of , leading to a universal estimator for a host of symmetric properties. Moreover, our estimator is computable in time nearly linear in . Our techniques extend to computing a -dimensional generalization of PML, where we have access to samples from multiple distributions on a common domain. This allows for universal plug-in estimation of various symmetric relationships between multiple distributions.
1.1 Overview of approach
The bulk of our work is dedicated to find a distribution that approximately maximizes the PML objective within an factor for a constant . We call such a distribution an approximate PML distribution. Given a sequence and its corresponding profile , the PML optimization problem is a maximization problem over all distributions . The objective function of the PML optimization problem is the probability of observing profile with respect to a distribution , which in turn is equal to the summation of probabilities of sequences (with respect to p) that have as their corresponding profile. The distribution that maximizes this objective is called a profile maximum likelihood (PML) distribution. (See Section 2 for formal definitions.)
To efficiently compute an approximate PML distribution, we first restrict ourselves to maximizing the PML objective for a discretized version of the profile over a class of distributions we call discrete pseudo-distributions (See Section 4). Here, the probability values of the distribution are restricted to belong to a small set P of permissible values (See Section 4.1)), and the frequencies in the profile are similarly restricted to belong to a small set M (See Section 4.2). We call the resulting maximizing distribution, a discrete PML (DPML) distribution and the corresponding optimization problem as DPML optimization (See Section 4.3).
There are two main features of the DPML optimization problem. Firstly, the maximizing distribution DPML is an approximate PML distribution with an approximation guarantee that we can control (as a function of the sizes of P and M). Secondly, the DPML optimization problem has a simpler equivalent formulation, in which sequences that have the same associated probability value with respect to a discrete pseudo-distribution are combined together into sub groups and the whole summation is written as a summation over a small number of subgroups. The number of these subgroups is a function of the sizes of P and M which we control (See Section 4.3 for both these results).
As an illustration of DPML, consider the profile and a probability distribution on 5 elements: two with a value of and three with a value of . Note that the probability values come from the set . One way to get the profile is to have an element of probability appear twice and two elements of probability appear once. There are choices of such elements and for each such choice, sequences of length 4 with these elements. The probability of any such sequence is the same: . We consider the set of all these sequences as one subgroup. Different subgroups are identified by specifying, for each permissible probability value, the frequencies with which elements of that probability value are seen in the sample. The DPML objective then sums up the contributions of each such subgroup.
Reformulating the problem in terms of summation over a small number of subgroups is crucial to our approach. It allows us to focus on the subgroup that gives the largest contribution to the objective instead of summing over all the subgroups. We call the optimization problem that optimizes the contribution of a single subgroup (instead of summing over all terms) as single discrete PML (SDPML). We show that the SDPML optimization problem has a convex relaxation and can be solved efficiently. Since there were a small number of these subgroups in the summation, the optimizing discrete pseudo-distribution that optimizes over just one subgroup has objective function value that is lower by at most the number of subgroups. Hence the maximizing discrete pseudo-distribution for this new objective function approximately optimizes the earlier objectives (PML and DPML) with bounded loss (See Section 4.3).
Ultimately, our algorithm first solves this convex relaxation to the SDPML optimization problem to obtain a fractional solution (in some representation space of these discrete pseudo-distributions) (See Section 4.4). Then we apply a rounding algorithm that finds a distribution which maintains the approximation guarantee need to obtain an approximate PML distribution (See Section 4.5).
1.2 Related work
As discussed in the introduction, PML was introduced by Orlitsky et al. [OSS*+*04] in 2004. Many heuristic approaches such as Bethe approximation [Von12, Von14], the EM algorithm [OSS*+*04], algebraic approaches [ADM*+*10] and a dynamic programming approach [PJW17] have been proposed to calculate the approximate PML.
The connection between PML and universal estimators was first studied in [ADOS16]. There have been several other approaches for designing universal estimators for symmetric properties. Valiant and Valiant [VV11b] adopted and rigorously analyzed a linear programming based approach for universal estimators proposed by [ET76] and showed that it is sample complexity optimal in the constant error regime for estimating certain symmetric properties (namely, entropy, support size, support coverage, and distance to uniformity). Recent work of Han, Jiao and Weissman [HJW18] applied a local moment matching based approach in designing efficient universal symmetric property estimators for a single distribution. [HJW18] achieves the optimal sample complexity in all error regimes for estimating the power sum function, support and entropy.
Estimating symmetric properties of a distribution is a rich field and extensive work has been dedicated to studying their optimal sample complexity for estimating each of these properties. Optimal sample complexities for estimating many symmetric properties were resolved in the past few years, including all the properties studied here: support [VV11b, WY15], support coverage [OSW16, ZVV*+*16], entropy [VV11b, WY16a] and distance from uniform [VV11a, JHW16].
Symmetric properties for distribution pairs have been studied in the literature as well. For instance, optimal sample complexity for estimation of KL divergence between two distributions were given by [BZLV16, HJW16].
1.3 Paper organization
The rest of the paper is structured as follows. In Section 2, we provide definitions and notations. In Section 3, we state our main results of the paper. Our main contribution is to provide an algorithm that efficiently compute an approximate PML and in Section 4 we prove this result. In this section, we also present an almost linear time algorithm based on cutting plane methods for solving our convex relaxation to SDPML; however we defer all of its analysis to the appendix. Finally, in Section 5, we provide the connection between approximate PML distribution and a universal estimator for symmetric property estimation. The proof presented in [ADOS16] showed this connection for an -approximate PML estimator and we show it for an -approximate PML estimator. However it is easy to see the proof presented in [ADOS16] works for any -approximate PML estimator for constant . In Appendix E we show that the techniques presented here generalize to a higher dimensional version of PML.
2 Preliminaries
Let and denote the interval of integers and reals and respectively and let . Let be the set of all distributions supported on domain and let be the size of the domain. We use the word distribution to refer to discrete distributions. Throughout this paper we assume that we receive a sequence of independent samples from an underlying distribution . Let be the set of all length sequences and be one such sequence with denoting its th element. The probability of observing sequence is:
[TABLE]
where is the frequency (multiplicity) of symbol in sequence and is the probability of domain element .
We extend and use the definition for to any vector by letting . Further, for functions of probability distributions p, we assume those expressions are also defined for any vector just by replacing by everywhere.
For any given sequence one could define its type (histogram) and profile (histogram of a histogram or fingerprint) that are sufficient statistics for symmetric property estimation. The histogram of histogram perspective comes from viewing type as a histogram and profile as histogram of type.
Definition 2.1** (Type).**
A type of a sequence is the vector of frequencies of domain elements in . We call the length of type and use to represent the set of all types of length .
To simplify notation we use just to denote type and the associated sequence will be clear from context. For a distribution , the probability of a type is:
[TABLE]
where and .
Definition 2.2** (Profile).**
For any sequence , let be the set of all its distinct frequencies and be elements of the set D. The profile of a sequence denoted is where is the number of domain elements with frequency in . We call the length of profile and as a function of profile , . We let denote the set of all profiles of length . 111The number of unseen domain elements is not part of the profile, because the domain size is unknown.
For any distribution , the probability of a profile is defined as:
[TABLE]
One can also define the profile of a type . We overload notation and use to denote the profile associated with type and .
For future use, we also write the probability of a profile in terms of its types. All types with have the same value and we use notation to represent this quantity. The explicit expression for is written below:
[TABLE]
We next derive an expression for the probability of a profile in terms of its types:
[TABLE]
The distribution which maximizes the probability of a profile is called a profile maximum likelihood distribution.
Definition 2.3** (Profile maximum likelihood).**
For any profile , a profile maximum likelihood (PML) distribution is:
[TABLE]
and is the maximum PML objective value.
The central goal of this paper is to define efficient algorithms for computing approximate PML distributions defined as follows.
Definition 2.4** (Approximate PML).**
For any profile , a distribution is a -approximate PML distribution if
[TABLE]
Throughout this paper we use the phrase approximate PML to denote a -approximate PML distribution for some non-trivial .
2.1 Representation of a profile
For any profile , we represent using the set of tuples, where a tuple denotes that number of domain elements have frequency in the sequence. We use to denote the size of profile in this representation. It is not hard to see that for any length profile . Further it takes time to write the profile in this representation.
For all our algorithmic results, when we are given a profile, we assume the above representation. We will explicitly state running times when we start with a sequence instead of a profile.
3 Results
Here we state the main results of this paper. Our first main theorem provides an algorithm to efficiently compute an approximate PML distribution. Our approximation guarantee in this result is something that depends on the running time itself and we can achieve sub-linear running times (in size of the sample) if we allow for weaker approximation guarantees.
Theorem 3.1** (Efficient and approximate PML distribution).**
Given a profile , let be its corresponding PML distribution. There is an algorithm that for any , computes an -approximate PML distribution , i.e.
[TABLE]
in time. Using this running time simplifies to .
In the above result, the best approximation is achieved for and we get an -approximate PML distribution in nearly linear time (in the number of samples). This result is summarized below.
Corollary 3.2** (Nearly linear time - approximate PML distribution).**
Let be a sequence and be its corresponding profile. There is an algorithm that computes an -approximate PML distribution in time .
This results constitutes the first polynomial time algorithm to compute an -approximate PML for any constant . In the corollary above we start with a sequence instead of a profile; in this case our algorithm still runs in because we only need time to compute the profile of a sequence in the representation discussed in Section 2.1.
Our next result relates an approximate PML distribution to a universal plug-in estimator that is sample complexity optimal for support size, coverage, entropy and distance from uniform. In Section 5, we prove this result. However it is easy to see the proof presented in Section 5 proves a more general result that approximate PML is sample complexity optimal for a broad class of symmetric properties satisfying certain conditions. One such set of conditions (informally) is the existence of an estimator for with following properties: the estimator is sample complexity optimal, the estimator has low bias, and the output of the estimator is not changed by much when we change any individual sample. This result was already shown in [ADOS16] for an -approximate PML distribution. Using the same proof with slight modifications we get the following result.
Theorem 3.3** (Universal estimator using approximate PML).**
Let be the optimal sample complexity of estimating entropy, support, support coverage and distance to uniformity and be a large positive constant. Let for any constant , then for any , the -approximate PML estimator estimates entropy, support, support coverage, and distance to uniformity to an accuracy of with probability at least .
Setting in the theorem above and combined with 3.2, we obtain the following result.
Theorem 3.4** (Efficient universal estimator using approximate PML).**
Let be the optimal sample complexity of estimating entropy, support, support coverage and distance to uniformity. If , then there exists a PML based universal plug-in estimator that runs in time and is sample complexity optimal for estimating entropy, support, support coverage and distance to uniformity to accuracy .
Our techniques for PML are general and can be extended to a generalization of PML to multiple dimensions (multidimensional PML). We provide a polynomial time (in number of samples) algorithm to compute approximate PML in multiple dimensions when the number of dimensions is constant. This allows for universal plug-in estimation of various symmetric relationships between multiple distributions. We next formally define and state our main results for multidimensional PML.
3.1 Results for multidimensional PML
First we describe the multidimensional setting, then we define multidimensional PML, and then state our main results. Throughout this paper we assume the number of dimensions is constant.
Multidimensional setup:
For each , we receive a sequence that consists of independent samples drawn from an underlying distribution supported on same domain (), further is independent of other sequences for and . We call a -sequence and its -length. Let be the set of all -sequences of -length equal to n. We use to denote the probability of domain element in distribution . We also refer to as a -distribution and let denote the set of all -distributions.
For any -distribution , the probability of a -sequence is defined as:
[TABLE]
Recall that for each , is the frequency of domain element in sequence . For any -sequence , we call the -frequency of domain element in . Let be the set of all -frequencies generated by different domain elements in all possible -sequences in and we let denote its th element. We next define multidimensional generalizations of profile, PML, and approximate PML.
-Profile:
For any -sequence , we call a -profile if and is the number of domain elements with -frequency . We call n the -length of and use to denote the set of all -profiles of -length equal to n. For any -distribution , the probability of a -profile is defined as:
[TABLE]
Profile maximum likelihood:
For any -profile , a Profile Maximum Likelihood -distribution is:
[TABLE]
and is the maximum PML objective value.
Approximate profile maximum likelihood:
For any -profile , a -distribution is a -approximate PML -distribution if
[TABLE]
.
We next state our results for approximate PML -distributions. In Footnote 2, we give a algorithm to efficiently compute an approximate PML -distribution. Then, we substitute in this result to get 3.6.
Theorem 3.5** (Efficient and approximate multidimensional PML).**
Let be a -sequence of -length . There is an algorithm that computes an -approximate PML -distribution in time222Here notation hides all terms and therefore term as well..
Corollary 3.6** (Efficient and approximate PML for two dimensions).**
For , let be a -sequence of -length . There is an algorithm that computes an -approximate PML -distribution in time.
As mentioned before, one of the important applications of approximate multidimensional PML is in estimating symmetric properties for -distributions. A symmetric property is a function of -distributions that is invariant to a permutation of the labels. Here we study one such symmetric property for called KL divergence that is studied in the context of PML. Estimation of KL divergence between two distributions is well studied and estimators that achieve optimal sample complexity were given by [BZLV16, HJW16]. In Theorem 3.7, we show that approximate PML is sample complexity optimal for estimating KL divergence. A similar result was already shown in [Ach18] (Theorem 6) for exact PML and we use the same proof with slight modification to prove our result. In 3.8, we give an efficient version of Theorem 3.7 by combining it with 3.6.
Theorem 3.7** (Optimal sample complexity for KL divergence).**
*Let be such that, , and let be the optimal sample complexity for estimating KL divergence between and to an accuracy . If 333Recall here is the size of domain . and , then -approximate PML -distribution (for ) with is sample complexity optimal for estimating KL divergence to an accuracy . *
Theorem 6 in [Ach18] also requires and a slightly weaker version of the other condition ().
Corollary 3.8** (Efficient estimator for KL divergence).**
Let be such that, , and let be the optimal sample complexity for estimating KL divergence between and to an accuracy . If and , then there exists a PML based universal plug-in estimator that runs in time and is sample complexity optimal for estimating KL divergence to an accuracy .
4 Existence of Structured Approximate PML for One Dimension
Here we provide the proof for Theorem 3.1. First, we show the existence of an approximate PML distribution with a nice structure in Sections 4.1, 4.2 and 4.3. Then, we exploit this structure in Section 4.4 to give an algorithm that returns a fractional solution with running time ranging from nearly linear to sub linear depending on the desired approximation factor. Finally, in Section 4.5 we present a rounding algorithm that takes the fractional solution from the previous step as input and returns an approximate PML distribution within the desired approximation factor.
First, we show the existence of a distribution with minimum non-zero probability value that is an -approximate PML distribution.
Lemma 4.1** (Minimum probability lemma).**
For any profile , there exists a distribution such that is a -approximate PML distribution and .
Proof.
See Appendix A. â
This lemma allows us define a region in which our approximate PML takes all its probability values and we use this fact throughout the paper. In Section 4.1 and Section 4.2 we show how we can further simplify the problem of computing an approximate PML by discretizing the probability and the frequency spaces respectively.
4.1 Probability discretization
Let where is such that for some . P is the set representing discretization of probability space and discretization introduces a technicality of probability values not summing up to one and we define pseudo-distributions and discrete pseudo-distribution to handle it.
Definition 4.2** (Pseudo-distribution).**
is a pseudo-distribution if and a discrete pseudo-distribution if all its entries are in P as well. We use and to denote the set of all such pseudo-distributions respectively. 444 As discussed in Section 2 we extend all functions of distributions as functions defined for any general vector in and therefore to pseudo-distributions as well. For convenience we refer to for any pseudo-distribution q as the âprobabilityâ of profile or PML objective value with respect to q.
One of the important structural properties we prove here is the following: there exists a discrete pseudo-distribution q that when converted to a distribution by dividing all its entries by its norm () is an approximate PML distribution. Even stronger, the discrete pseudo-distribution q itself has value that approximates within a good factor and converting q into a distribution by its norm is only going to help us in this probability because . In the rest of the paper we refer to such a discrete pseudo-distribution as an approximate PML pseudo-distribution and for the earlier reason we focus on finding an approximate PML pseudo-distribution.
The way we show the existence of such a discrete pseudo-distribution that is an approximate PML pseudo-distribution is by taking the PML distribution and converting it into a discrete pseudo-distribution while still preserving the PML objective value to a desired approximation factor. Our next lemma formally proves a general version of this statement. In the remainder of this paper, for notational convenience, for a scalar and set S we use the notation and to denote:
[TABLE]
Definition 4.3** (Discrete pseudo-distribution).**
For any distribution , its discrete pseudo-distribution is defined as:
[TABLE]
Note that . Further, for , . We next state a result that captures the impact of discretizing the probability space.
Lemma 4.4** (Probability discretization lemma).**
For any profile and distribution , its discrete pseudo-distribution satisfies:
[TABLE]
Proof.
The first inequality is immediate because for all . To show second inequality consider any sequence ,
[TABLE]
In the inequality above we use . Now,
[TABLE]
â
4.2 Multiplicity discretization
Let be the set representing discretization of multiplicities where is such that , and as before will be carefully choose later. Let and note the definition of M keeps all positive integers . We use to denote elements of set M and using this set M we define an analogous quantity to profile called discrete profile.
Definition 4.5** (Discrete profile).**
For a sequence , its discrete profile is a profile and is defined as: , where and is the length of discrete profile with . We use to denote the set of all such discrete profiles.
Note:
As mentioned in the definition, a discrete profile is also a profile. Note that in the representation of discrete profile we might have indices with , however we have defined profiiles so that there are no such zero entries. We keep these zero entries in our discrete profile for notational convenience and proof simplification. Further it only takes time to write a discrete profile from access to a profile in the representation discussed in Section 2.1.
A discrete profile is a profile of length and it correspond to profile of some sequences of length . One such sequence can be obtained by appending of symbols to sequence itself. The probability of with respect to a distribution p is straightforward:
[TABLE]
We next state a result that captures the impact of discretizing the multiplicity space. It is important to note that probability terms ( and ) have different summation terms and yet we show their values approximate each other.
Lemma 4.6** (Profile discretization lemma).**
For any distribution , and a sequence :
[TABLE]
where and are the profile and discrete profile of respectively.
Proof.
See Appendix B. â
Combining both Lemma 4.4 and Lemma 4.6 we bound the impact of discretizing both probabilities and multiplicities.
Corollary 4.7** (Discretization lemma).**
For any distribution , and a sequence . If is the discrete distribution of p then,
[TABLE]
where and are the profile and discrete profile of respectively.
The discretization lemma above suggests that optimizing over over discrete pseudo-distributions with as input is approximately as good as as optimizing over distributions with as input. This result motivates the definition of a new objective function which we introduce and study next.
4.3 Discrete PML Optimization
Here we define a new optimization problem that admits convex relaxations and further returns an approximate PML pseudo-distribution555Note we call a pseudo-distribution q an approximate PML pseudo-distribution if it satisfies , for some non-trivial .. First, we define a discrete profile maximum likelihood (DPML) which is just the PML objective maximized over discrete pseudo-distributions with discrete profile as input. In 4.9 we show the optimal discrete pseudo-distribution of this new objective is an approximate PML pseudo-distribution. In Lemma 4.10, we rephrase the DPML optimization problem. Finally, using this DPML reformulation, we define a new optimization problem that we call a single discrete PML (SDPML) and in Lemma 4.14, we show the maximizing discrete pseudo-distribution for the SDPML objective is an approximate PML pseudo-distribution.
Definition 4.8** (Discrete profile maximum likelihood).**
Let be any sequence, and be its profile and discrete profile respectively, a discrete profile maximum likelihood (DPML) pseudo-distribution is:
[TABLE]
and is the maximum objective value.
Corollary 4.9** (DPML is an approximate PML).**
For any sequence if and are its profile and discrete profile respectively, then
[TABLE]
Proof.
Note that is a discrete pseudo-distribution. The result follows from 4.7 applied to . â
In a approximate sense, our 4.7 suggests that working with discrete profile and discrete pseudo-distributions is no different than original profile and distribution itself.
In the next two lemmas we rephrase the DPML optimization problem in forms that are amenable to convex relaxation. To do this, we introduce some new notation.
- âą
As before let P and M be sets representing discretization of probabilities and frequencies respectively. Recall that we used to denote the elements of set M and we use to denote the elements of set P. Let be the vector with elements indexed from to and th element equal to . Also let be the vector with elements indexed from [math] to . Its zeroth entry (denoted by ) is equal to [math] and th entry is equal to .
- âą
Let be a variable matrix with entries for . As in the case for vector , our second index of variable matrix starts at [math] and not at . Here the variable counts the number of domain symbols with probability value and frequency . Further, counts the number of unseen domain symbols with probability value .
- âą
For any vector v and set , we use to denote the length vector corresponding to the portion of vector v associated with index set .
- âą
For a discrete profile (corresponding to sequence ), define
~{}~{}~{}~{}\textbf{K}_{\phi^{\prime}}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\{X\in\mathbb{Z}_{+}^{b_{1}\times(b_{2}+1)}~{}\Big{|}~{}~{}(X^{T}\mathrm{1})_{[1,b_{2}]}=\phi^{\prime},\text{ and }\zeta^{T}X\mathrm{1}\leq 1\}
Note the constraint does not involve variables that corresponds to unseen elements. These variables only appear in the constraint which ensures our output is always a pseudo-distribution.
- âą
For a discrete profile (of ) and a discrete pseudo-distribution q, also define
~{}~{}~{}~{}\textbf{K}_{\textbf{q},\phi^{\prime}}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\{X\in\mathbb{Z}_{+}^{b_{1}\times(b_{2}+1)}~{}\Big{|}~{}~{}(X^{T}\mathrm{1})_{[1,b_{2}]}=\phi^{\prime},\text{ and }X\mathrm{1}=\ell^{\textbf{q}}\} where and denote the number of domain elements with probability value in pseudo-distribution q. It will be clear from our next lemma why we define these constraint sets.
The advantage of probability and profile discretization we described earlier is that many types in the set share the same probability value of being observed and our goal is to group them using these variables. Exploiting this idea, we next give a different formulation for the DPML objective.
Lemma 4.10** (DPML objective reformulation).**
For any discrete pseudo-distribution and discrete profile :
[TABLE]
Proof.
Recall from Equation 3,
[TABLE]
For convenience, we call a type valid if it belongs to set . Recall that variable represents the number of domain elements with probability value and frequency . In this representation and for the discrete pseudo-distribution q, each valid type corresponds to the following unique variable assignment : . Using the previous expression it is not hard to write the exact expression for the probability term associated with the valid type ,
[TABLE]
Previous discussion showed that every valid type corresponds to a unique variable assignment. However this uniqueness property no more holds in the reverse direction and multiple valid types might share the same variable assignment. This where our grouping occurs and is an interesting case that we study next.
For any variable assignment , it is clear from the middle term in Equation 7 that all valid types associated with share the same probability value of being observed. With this observation, it is now enough to argue about the number of valid types associated with a variable assignment to prove our lemma. We make this argument next by constructing all valid types associated with .
First consider all domain elements with a fixed probability value and the number of these elements is equal to . We can generate part of a valid type corresponding to probability value by picking any partition of these domain elements into groups of sizes . This corresponds to a multinomial coefficient and the number of types associated with is just,
[TABLE]
Here we only generated partial valid types corresponding to probability value . To generate a full valid type we just need to combine these partial valid types generated for each probability value . Let denote all such full valid types associated with a variable assignment and generating a full valid type corresponds to groups (for each probability value ) of independent possibilities considered conjointly. Further the cardinality of set is just the multiplication of cardinalities of each of these groups and is explicitly written below,
[TABLE]
We are almost done with the proof and all we do next is formally derive the expression in our lemma statement to complete the proof. From Equation 3,
[TABLE]
â
In the lemma above we wrote the in terms of constraint set and to use this definition we need access to pseudo-distribution q. We overcome this difficulty in our next lemma by giving an inequality that relates with constraint set that only depends on and not q itself.
Lemma 4.11** (DPML objective relaxed).**
For any sequence , and a discrete pseudo-distribution the DPML objective can be upper bounded by:
[TABLE]
where is discrete profile of .
Proof.
The proof follows because and invoking Lemma 4.10. â
In the above lemma we only showed one side of the inequality and it not clear how working with RHS relates to the LHS. Inf Section 4.5 we present an algorithm to achieve the other side of the inequality. The cardinality of set in the above formulation is small and we formalize this next.
Lemma 4.12** (Cardinality of ).**
For any sequence and its associated discrete profile :
[TABLE]
Proof.
is a set of vectors in and each coordinate takes an integer value in (Lemma 4.1 combined with the constraint ensures this fact). The lemma statement follows because . â
In our final optimization problem we just optimize over one term in the set instead of working with summation over all the terms. Focusing on the largest of these terms, gives a approximation of the sum. Combining this with Lemma 4.12 motivates us to consider the following objective, define:
[TABLE]
It is important to note that there is a discrete -pseudodistribution that correspond to each variable assignment . The description of this distribution is as follows: For each , the number of domain elements with probability value in q is equal to 666This description only provides non zero probability values and also does not provide any labels, however it is sufficient for estimating all symmetric properties mentioned in this paper.. We now go ahead and define the optimization problem involving that also help us compute the term that is largest in the summation of terms in Equation 8. After this definition, we provide a lemma relating the PML objective with this new optimization problem.
Definition 4.13** (Single discrete profile maximum likelihood).**
For any sequence and its associated discrete profile , a single discrete profile maximum likelihood (SDPML) distribution is:
[TABLE]
and is the pseudo-distribution corresponding to .
Lemma 4.14** (SDPML relation to PML).**
For any sequence ,
[TABLE]
where and are the profile and discrete profile associated with .
Proof.
[TABLE]
The second inequality follows from Lemma 4.12, 4.11 and last follows from 4.9. â
To simplify and better understand the expression in Lemma 4.14 just substitute and note that , and is just one term in the summation of terms in Equation 6. Using Lemma 4.10 we know that and combining this with previous lemma we get that the discrete pseudo-distribution is an -approximate PML pseudo-distribution. All we do next is provide a convex relaxation for function to arrive at our final optimization problem. This relaxation produces a real valued and later we give a rounding algorithm to get an integral solution.
4.4 Convex relaxation of SDPML
In the previous subsection we showed that the SDPML objective is a good approximation to the PML objective. However the objective function of SDPML is defined only over the integers and in this subsection we present a convex relaxation of SDPML.
First, we consider the feasible set of SDPML and relax the integer constraint on variables to get the following new constraint set:
[TABLE]
In the later subsections, we show how we deal with these fractional solutions by presenting a rounding algorithm with a good approximation ratio.
Secondly, we relax the objective function of SDPML itself. The objective of SDPML is defined only on the integral set. We next define a continuous relaxation of this objective function which is also log-concave.
[TABLE]
The lemma below states that continuous version is not far from the actual SDPML objective.
Lemma 4.15** ( approximates SDPML objective).**
For any sequence and its associated discrete profile . If , then
[TABLE]
Proof.
See Appendix C. â
A key fact about function is that it is log-concave, so we can apply optimization machinery from convex optimization to optimize it.
Lemma 4.16**.**
Function is log-concave in .
Proof.
See Appendix C. â
Maximizing log concave objective function over the relaxed convex set easily reduces to a convex optimization problem and can be solved efficiently. Below is the convex relaxation of our SDPML objective,
[TABLE]
Formulation above is in the form of a general optimization problem in [LSW15a] that solves it using a cutting plane method. The algorithm in [LSW15a] requires to implement a -2nd-order-optimization oracle (defined later in the appendix) and we provide an algorithm to implement this -2nd-order-optimization oracle for our convex program. Further, to upper bound the number of calls to such an oracle we need to bound the singular values of our constraint matrix. Everything put together we get the following theorem.
Theorem 4.17** (Solver for convex relaxation to SDPML).**
There exists a cutting plane method based algorithm that outputs a feasible solution to optimization problem 12, i.e. and satisfies:
[TABLE]
in time.
Proof.
See Appendix D. â
4.5 Algorithm and runtime analysis
Here we give the complete description of our final algorithm to find an approximate PML distribution. The analysis in previous sections suggests that it suffices to find a discrete pseudo-distribution that approximates SDPML objective, which we replaced by a convex relaxation. First, we give the complete algorithm. Then, we present the algorithm that takes an optimal solution to the convex proxy for SDPML and produces an approximate PML distribution. Recall that \textbf{K}^{f}_{\phi^{\prime}}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\{X\in\mathbb{R}^{b_{1}\times(b_{2}+1)}~{}\big{|}~{}(X^{T}\mathrm{1})_{[1,b_{2}]}=\phi^{\prime},\text{ and }\zeta^{T}X\mathrm{1}\leq 1\}.
In the algorithm we first maximize over the set of fractional solutions instead of and we round our solution to an integral solution that belongs to extended set of . The rounding algorithms is presented next.
The solution returned by the rounding procedure is defined on an extended discretized probability space , where . To derive the relation between solution and PML objective value we need to extend some definitions studied earlier. First, we define as the vector whose entries are exactly the elements of . Note we still use for all to refer to elements of . Further, for any pseudo-distribution q with all its probability values in set (we call it an extended discrete pseudo-distribution) and discrete profile , we first define following extensions of sets and ,
[TABLE]
[TABLE]
where and denote the number of domain elements with probability value .
Further by Lemma 4.10, for any extended discrete pseudo-distribution q and a discrete profile , the following equality holds,
[TABLE]
Similarly for any , below are the natural extension of definitions of functions and ,
[TABLE]
[TABLE]
We are ready to analyze our rounding algorithm. First, we provide some interesting properties solution returned by our rounding procedure.
Claim 4.18**.**
The solution returned by rounding procedure (2) above satisfies:
** 2. 2.
.
Proof.
Claim (1) follows because for all . Now note because of the adjustments made by new level sets. Further,
[TABLE]
The final inequality follows because and therefore and Claim (2) follows. â
We next show that for any solution returned by our rounding algorithm (2), the values and are close to each other and we summarize this next.
Lemma 4.19**.**
For any returned by rounding procedure above satisfies:
[TABLE]
Proof.
See Appendix C. â
Further using Equation 13, for any , if is its corresponding extended discrete pseudo-distribution, then
[TABLE]
In our next lemma, we show that the solution returned by the rounding procedure approximates . Note from Lemma 4.14, we know that is a good approximation to the PML objective.
Lemma 4.20**.**
The solution returned by rounding procedure above satisfies:
[TABLE]
Proof.
For any and returned by our rounding procedure below are the explicit expressions for and :
[TABLE]
[TABLE]
We first bound the probability term:
[TABLE]
The first inequality follows because . The fourth inequality follows from AM-GM inequality. The final expression above is the probability term associated with and the equation above shows that our rounding procedure only increases the probability term and all that matters is to bound the counting term that we do next.
[TABLE]
In the derivation above we used (1) in Claim 4.18. It remains now to lower bound :
[TABLE]
The first and second inequality follow from Lemma 4.19 and Equation 17 respectively. In the third inequality we used because is the optimal solution over the relaxed constraint set and finally invoked Lemma 4.15 to relate and g. â
Now construct the extended discrete pseudo-distribution corresponding to the solution returned by Algorithm 2 by assigning elements with a probability value of . We next provide the proof for our main theorem that proves the distribution is an approximate PML distribution. Our next theorem proves that the distribution is an approximate PML distribution.
See 3.1
Proof.
Let be the pseudo-distribution corresponding to solution returned by Algorithm 2. Set , then:
[TABLE]
The first inequality follows because , second inequality from 4.7, third inequality follows because (because we constructed from ) and computes just one term in the summation over (look at the representation of as summation over from Equation 15), fourth inequality comes from Lemma 4.20 and last inequality follows from Lemma 4.14.
We bound the total running time as follows. Given a profile , it takes to write down the discrete profile , then we need to solve the convex optimization problem 12 which further takes and our final rounding algorithm can be implemented in time (). The claimed running time follows by combining these bounds. â
5 Unified optimal sample complexity for symmetric properties
Here we study the connection between a universal estimator and approximate PML. We first recall the following theorem in [ADOS16].
Theorem 5.1** (Theorem 4 of [ADOS16]).**
For a symmetric property f, suppose there is an estimator , such that for any p and observed profile ,
[TABLE]
any -approximate PML distribution satisfies:
[TABLE]
Our goal here is to prove Theorem 3.3 that shows the following: computing an -approximate PML distribution is sufficient to get a plug-in universal estimator that is sample competitive for estimating support size, coverage, entropy and distance from uniform. The proof presented in [ADOS16] showed this connection for an -approximate PML estimator and it is easy to see the proof presented in [ADOS16] works for any -approximate PML estimator for constant . We will need the following two lemmas from [ADOS16, HR18].
Lemma 5.2** (Lemma 2 of [ADOS16]).**
Let be a fixed constant. For entropy, support, support coverage, and distance to uniformity there exist profile based estimators that use the optimal number of samples, have bias and if we change any of the samples, changes by at most , where is a positive constant.
Lemma 5.3** ([HR18]).**
**
See 3.3
Proof.
Let f be the property we wish to estimate, p be the underlying distribution and are the observed sequence and profile. Set ( is a constant and so is ) and let be the estimator returned by Lemma 5.2. The bias of estimator is
[TABLE]
By McDiarmidâs inequality we get:
[TABLE]
where is the change in when one of the samples is changed. Using these inequalities we get:
[TABLE]
In the derivation above we used (Lemma 5.2). Invoke Theorem 5.1 with we get:
[TABLE]
In the first inequality we used Lemma 5.3. â
Appendix A Minimum Probability
Here we provide the proof for our first technical lemma that gives a lower bound of for the minimum non-zero probability value of a -approximate PML distribution. To show such a result we use an independent rounding algorithm that is described in the lemma below. We need the following simple claim for the proof of our next lemma.
Claim A.1**.**
For any non-negative and non-zero vector v and a profile ,
[TABLE]
Proof.
[TABLE]
â
See 4.1
Proof.
We do independent rounding to show the existence of such a distribution. For notational convenience we use to denote the probability of symbol in the PML distribution . Let and for all we define a random variable as follows:
[TABLE]
Clearly ,
[TABLE]
and in general for any integer power of random variable we have:
[TABLE]
For the remaining () with we define:
[TABLE]
Define and .
[TABLE]
[TABLE]
[TABLE]
Define to be the concatenation of random vectors Y and Z. All random variables are mutually independent and we have:
[TABLE]
(From Equation 67,68 and the fact that is a constant random variable).
When we generate a random sample p from this distribution, we have a lower bound on the expected value of but this is misleading since p may not be a distribution. Scaling p to 1 could significantly reduce the value of if is large. However, we show that a constant fraction of the expectation of comes from the sample space with bounded . Here is a constant and assume . Note that:
[TABLE]
The last inequality follows because Z is a constant random vector.
[TABLE]
To argue that a constant fraction of the expectation comes from the sample space with small we need a tight upper bound for:
[TABLE]
For , we first upper bound the probability term:
[TABLE]
We will use Chernoff bounds here and to apply them, we convert the random variables into Bernoulli random variables. Define ,
[TABLE]
Equivalently:
[TABLE]
Define and . For any ,
[TABLE]
Since is a sum of Bernoulli random variables, by Chernoff bounds:
[TABLE]
Note from A.1 that:
[TABLE]
[TABLE]
Substituting back in Equation 69 we have (for ),
[TABLE]
[TABLE]
[TABLE]
The above inequality implies existence of a with and . Define ,
[TABLE]
[TABLE]
In the final inequality substitute and observe . Also our rounding procedure always ensures that minimum non-zero entry of is that further implies a lower bound on the minimum non-zero probability value of to be . Hence is our final distribution satisfying the conditions of lemma. â
Appendix B Profile Discretization Lemma
Here we prove our profile discretization lemma. We first introduce a new definition called discrete type and then provide new formulations which help us in our proof.
Definition B.1** (Discrete type).**
For a sequence , its discrete type is:
[TABLE]
For a sequence let be the set of all its distinct frequencies plus all integers less than and be elements of the set D. For this extended set D, the definition of profile is still the same and . In this extended definition there might be indices with and this extended definition help us write cleaner proof for the next lemma. We first state an equivalent formulation for the probability of its profile (from Equation 20 in [OSZ03], Equation 15 in [PJW17]) in terms of its type :
[TABLE]
where is the set of all permutations of domain set and is the number of unseen domain elements. The difference between Equation 24 and Equation 3 is the index set over which they are summed.
See 4.6
Proof.
Let and be the type and discrete type of sequence respectively. By Equation 24:
[TABLE]
Similarly:
[TABLE]
where is the number of unseen domain elements in profile . Note because our discretization procedure does not change the number of unseen domain elements. We now analyze both objectives term by term. For any permutation
[TABLE]
The first inequality above follows because and using we get the following inequality.
[TABLE]
Lets consider terms and next:
[TABLE]
[TABLE]
Next we lower bound the same quantity:
[TABLE]
Combining both we get:
[TABLE]
To bound our final term we use the extended definition of D. In this definition of D we included all integers less than and we have for all . Similarly recall all integers less than also belong to set M and therefore for all . Now observe that any frequency strictly less than () is not discretized and,
[TABLE]
The number of domain symbols with is at most and . This further implies, . Hence the ratio evaluates to:
[TABLE]
Rewriting the final inequality:
[TABLE]
Combining all eqs. 25, 26 and 27 we have our result. â
Appendix C Remaining proofs for Section 4
Here we prove multiple lemmas associated with our functions and . Our first lemma shows that functions and approximate each other in their values and later we also show that function is log-concave in . To help readability of this section lets recall definitions of functions and . For any ,
[TABLE]
Also for any ,
[TABLE]
See 4.15
Proof.
By Stirlingâs approximation for all integer :
[TABLE]
We slightly use a weaker version of this inequality that holds all integers ,
[TABLE]
[TABLE]
[TABLE]
In the above expression we used the fact that each , (Lemma 4.1 combined with the constraint ensures this fact). Also,
[TABLE]
â
Next we show that function is log-concave in and we need the following lemma to prove it.
Lemma C.1**.**
The function defined for all by
[TABLE]
is convex.
Proof.
Let . Direct calculation reveals that for all ,
[TABLE]
The Hessian matrix H is:
[TABLE]
Let and also be the entry wise square root vector,
[TABLE]
[TABLE]
[TABLE]
The last inequality holds because is a rank one matrix and its spectral norm is equal to 1:
[TABLE]
. â
See 4.16
Proof.
Recall the definition of :
[TABLE]
Taking on both sides:
[TABLE]
The first term is linear in and we consider the negative of second and third term and show it is convex.
[TABLE]
In the above expression is the âth column of matrix . By Lemma C.1 each of the functions is convex and is also convex ( is concave). is sum of a linear and a concave function, and is concave. â
In the remaining part of this section, we prove our final result of this section that is used to bound the approximation guarantee of our rounding procedure. Recall our rounding procedure introduces new probability values resulting in a extended discretized probability space , where . To derive the relation between solution and PML objective value we defined extended sets and . Further for any , recall that functions and are defined as follows,
[TABLE]
[TABLE]
In the following lemma we show that for any returned by our rounding procedure the functions and approximate each other in their values. See 4.19
Proof.
For all integers , recall the weaker version of sterlings approximation we used earlier ,
[TABLE]
Now,
[TABLE]
and
[TABLE]
Now and for any , is a convex combination of elements in P and therefore . In the above expression we used the fact that each , (For any , and further combined with the constraint (because ) ensures this fact). Also,
[TABLE]
In the second inequality we used the fact that solution returned by our rounding procedure always satisfies for all , and . â
Appendix D Algorithm for solving our convex program
To make this section self readable we start by recalling our original SDPML objective.
[TABLE]
We relaxed it to:
[TABLE]
where function is defined as:
[TABLE]
For the optimization problem can be formulated equivalently as:
[TABLE]
where the constraint set is given by
[TABLE]
and function is:
[TABLE]
Our constraint set is bounded and for any ,
[TABLE]
However on the other hand our function is not well behaved as the boundedness of f doesnât imply any good polynomial bound on . We leverage the fact that our feasible set is bounded to define a new function which is close to our original function f inside the feasible region and is also well behaved outside it. Define:
[TABLE]
where and for any : . Hence optimizing is equivalent to optimzing in an approximate sense:
[TABLE]
Let be the matrix which corresponds to distribution . Recall the maximum PML objective is a probability term and is not hard to see that it is always between (lower bound comes from uniform distribution on ) and , (using a crude approximation) because they approximate the value of . Combining all we get that optimum value of both optimization problems in Equation 33 are always greater than .
In the rest of the section we show how to solve the optimization problem:
[TABLE]
which can be equivalently written as:
[TABLE]
where the convex set .
First we show how to solve a simple optimization problem which in turn will act as an oracle to solve our main optimization problem 46 using cutting plane method from [LSW15a]. The simple optimization problem which we will refer to as oracle here on is stated next:
[TABLE]
where , , K is the same convex set and is the same convex function defined above.
We implement the oracle, that is, solve optimization problem 35, by solving a sequence of unconstrained problems that penalize leaving the set K. Formally, for all we define:
[TABLE]
To implement our oracle we will show how solve the following to high precision
[TABLE]
Our result will then follow by performing binary search on and invoking this subroutine.
For any let be the optimal solution for optimization problem 37 and also let be the optimal solution to 35. It is clear that:
[TABLE]
The second to last inequality follows because . Hence we have:
[TABLE]
Higher the value of more incentive is to satisfy the constraint.
Lemma D.1**.**
For all the following holds
[TABLE]
where is the optimum solution pair for optimization problem 37.
Proof.
Direct calculation shows that the following derivatives for hold for all input:
[TABLE]
[TABLE]
By the optimality of and we know these derivatives are [math] at and therefore:
[TABLE]
Consequently,
[TABLE]
and substituting this and the value of into the formula for H yields
[TABLE]
Combining this equality with the following upper bound for yields the result:
[TABLE]
â
Corollary D.2**.**
For any and , where
[TABLE]
Proof.
Suppose , then by Lemma D.1, it holds that:
[TABLE]
The final inequality follows from the conditions of the corollary. â
Next we show that is differentiable with respect to and therefore, H, , and are continuous with respect to . The crux is the following, simple, possibly well known fact whose proof is a slight modification of that in (cite geometric median).
Lemma D.3**.**
Let be a twice differentiable function and for all and define the function by and let . If is strictly concave for all then is differentiable as a function of .
Proof.
By the optimality conditions for we know that . Consequently, since is differentiable, differentiating with respect to yields by chain rule that
[TABLE]
However, since is strictly concave, all eigenvalues of this matrix are negative and this matrix is invertible yielding the desired result. â
Lemma D.4**.**
Functions , and are continuous in .
Proof.
Since H is twice differentiable and H is strictly concave, Lemma D.3 implies that is differentiable and therefore continuous as a function of . Since and are continuous functions the result follows. â
Lemma D.5**.**
Let be the optimum solutions to Optimization problem 37 with respect to and respectively. For any :
[TABLE]
Proof.
Suppose that as the proof for when is analogous. Then since we have and
[TABLE]
The result follows as and . â
Corollary D.6**.**
Let be the optimum solutions to Optimization problem 37 with respect to and respectively. For any :
[TABLE]
Proof.
Given and . By first part of the Lemma D.5 . Suppose by the second part of same Lemma D.5 we have A contradiction! â
Lemma D.7**.**
For any ,
[TABLE]
where
Proof.
Observe that we can optimize problem 37 with respect to and t independently. Lets look at the function behaviour with respect to . From equation 40-42 we have:
[TABLE]
Also note that for because the term dominates and also there is a trivial solution with . Combining all we get and the function because all are bounded.
[TABLE]
â
Lemma D.8**.**
For any , we can find a solution such that in time .
Proof.
Lets recall the objective of optimization problem 37:
[TABLE]
Lets recall the optimality conditions from Equation 39:
[TABLE]
Rearranging terms and taking exponential on Equation 43 yields:
[TABLE]
where and . Let and we define new variables which satisfy the following conditions,
[TABLE]
and we know that and should satisfy . Lets rewrite Equation 44 in terms of variables:
[TABLE]
This can be written equivalently as:
[TABLE]
From Lemma D.7, we can do binary search in to guess . Let and be the current lower and upper bounds for the value of : Assign and we can do binary search to find such that because for fixed a and K function is monotone (increasing) in ().
If , assign and Equation 44 is satisfied and we are done. 2. 2.
If , update that is decrease our guess for to and observe that next iteration values of all increase as is fixed. 3. 3.
Else If , update because of the similar analysis as case above. 4. 4.
Assign and repeat.
Note we never have to work with variables, we introduced them to better understand our binary search procedure. From Lemma D.7 we have a good bound on and the above procedure finds a solution such that and (because we have closed form expression for ) in time . â
Lemma D.9**.**
Optimization problem 35 can be solved to accuracy in time .
Proof.
First we show that solving optimization problem 37 for for which the solution pair satisfies for solves our main problem 35. Observe that the solution pair satisfies our constraint and also our objective value for problem 37 at is greater than as shown below:
[TABLE]
The first inequality follows because and the later one follows from Equation 38. By similar reasoning we are also done if at the optimal solution pair (has closed form solution) satisfies the constraint and it is interesting if this constraint is not satisfied at . In such a case existence of such that follows from continuity (Lemma D.4) and boundedness of (Corollary D.2) for which the constraint is satisfied. Corollary D.6, and D.5 suggests that we can find an by binary search over the interval such that and Lemma D.8 finds a solution such that . Choose .
If then so is and value of .
Else then so is and value of .
We can do similar analysis for other terms in and the boundedness of follows because:
[TABLE]
Recall and combined with inequality above and implies:
[TABLE]
Now all that remains is to bound the objective value of optimization problem 35 .
[TABLE]
The whole procedure can be implemented in time â
Now we are in good shape to solve our main optimization problem 46. First we write our optimization problem in vector form:
[TABLE]
where the convex set and our matrix 777Our matrix A is a sparse matrix and matrix vector product with it can be computed in time and with vector represent the linear constraints in the set .
[TABLE]
Formulation above is in the form of a general optimization problem in [LSW15a]. For convenience we redefine the optimization problem from [LSW15a]:
[TABLE]
where K is a convex set. To invoke the algorithm to solve this general optimization problem algorithm in [LSW15a] requires to implement a -2nd-order-optimization oracle which is define below:
Definition D.10**.**
Given a convex set K and . A -2nd-order-optimization oracle for K is a function on such that for any input and , it outputs such that
[TABLE]
We denote by the time complexity of this oracle
Our simple optimization problem 35 is exactly the -2nd-order-optimization oracle for our main optimization problem 46. Consequently, all the remains to solve optimization problem 46 is to bound the eigenvalues of and put together the results of this section to obtain our desired running time. We do this in Lemma F.8 and Theorem D.12 respectively.
Lemma D.11**.**
The eigenvalues of matrix are either or of the form
[TABLE]
and therefore the smallest eigenvalue of is at least
[TABLE]
Proof.
Direct calculation shows that if is -dimensional all ones vector, is the -dimensional identity matrix and with then for all and we have
[TABLE]
Consequently is an eigenvector of with eigenvalue if and only if
[TABLE]
Now if then we see the is an eigenvector if and only if in which case the eigenvalue is . On the other hand if then we see is an eigenvector of eigenvalue if and only if
[TABLE]
When this happens we have and solving for yields that
[TABLE]
Substituting this into yields the eigenvalues.
[TABLE]
The lower bound follows from the fact that for when and therefore
[TABLE]
The smallest eigenvalue is at least . Recall and is such that and . Lemma statement follows because , .
â
Below is the theorem we invoke to solve the optimization problem.
Theorem D.12** (Theorem 56 from [LSW15b]).**
Assume that , \big{\|}b\big{\|}_{2}<M, \big{\|}c\big{\|}_{2}<M, \big{\|}\textbf{A}\big{\|}_{2}<M and . Assume that and we have -2nd-order-optimization oracle for every . For , we can find such that
[TABLE]
and \big{\|}\textbf{A}z-b\big{\|}_{2}\leq\delta. This algorithm takes time
[TABLE]
where is the number of rows in A, and .
Theorem D.13**.**
Optimization problem 46 can be solved in time
Proof.
The proof follows by combining Lemmas D.12, D.8, D.9 and noting that all the parameters in the running time , are all bounded by and we only pay logarithm in these terms. â
Appendix E Proofs for multidimensional PML
Here we show how our techniques built throughout this paper apply to a general setting. In particular, we provide an efficient algorithm for computing approximate PML in higher dimensions when the dimension is constant. The proofs and techniques are analogous to one dimensional PML but there are few places such as, minimum probability lemma proof, singular value lower bound for the constraint matrix (for optimization) where we require general proofs.
E.1 Preliminaries for -dimensional objects
-tuple: c is a -tuple if . For all , we use to denote its âth element.
Arithmetic operations on -tuples: For any two -tuples c, and an arithmetic operator , the operation denotes element wise operation, meaning it outputs another -tuple equal to . Further for any -tuple c and scalar , the operation denotes element wise scalar operation, meaning it outputs another -tuple equal to . Just in the case of power operation we return a scalar value and is equal to:
[TABLE]
Also for a -tuple c and scalar we define:
[TABLE]
Logic operations on -tuples: For any two -tuples c and and a logic operator , the operation is true if and only if is true for all . Further for any -tuple c and scalar , the logic operation is true iff is true for all .
Floor and ceil operations on -tuples: For a -tuple c and set S of -tuples we use the notation and to denote the following -tuples:
[TABLE]
We next recall (defined in Section 3.1) the setting for higher dimensions.
Setting for higher dimension: For each , we receive a sequence that consists of independent samples drawn from an underlying distribution supported on same domain , further is independent of other sequences for and . We call a -sequence and its -length. Let be the set of all -sequences of -length equal to n. We use to denote the probability of domain element in distribution . We also refer as a -distribution and let be the set of all -distributions.
For any -distribution , the probability of a -sequence is defined as:
[TABLE]
Recall for each , is the frequency of domain element in sequence . For any -sequence , we call the -frequency of domain element in . Let be the set of all -frequencies generated by different domain elements in all possible -sequences in and we use to denote its th element.
We next define few more -dimensional objects of interest.
-vector: is a -vector if for each element , is a vector supported on the same domain . We use to denote the row corresponding to domain element and to denote its âth column. Let be the set of all -vectors and note that -distribution is a -vector.
Norm of -vectors: For a -vector v, its norm denoted by is a -tuple equal to .
-pseudodistribution: is a -pseudodistribution if for each element , is a pseudo-distribution supported on the same domain or equivalently . Let be the set of all -pseudodistributions and .
-level set: For a -distribution p and -pseudodistribution q, we call and -level sets corresponding to respectively.
-Type: For any -sequence , represents -type of and we call n its -length. Recall is the type of sequence and we overload notation and let denote . We use to denote the row corresponding to domain element and all mean the same thing. Let be the set of all -types of -length equal to n.
For a -distribution , the probability of a -type is:
[TABLE]
We use the following shorthand notation to denote the counting term in the above expression.
[TABLE]
-Profile: For any -sequence , is a -profile if and 999The -profile does not contain -frequency element because we donât know the number of unseen domain symbols. is the number of domain elements with -frequency . We call n the -length of and use to denote the set of all -profiles of -length equal to n.
For any -distribution , the probability of a -profile is defined as:
[TABLE]
We can also define the -profile of a -type . We overload notation and use to denote the -profile associated with -type and . Consider all types such that and observe that they all have the same value. We use notation to represent this quantity:
[TABLE]
[TABLE]
Profile maximum likelihood: For any -profile , a Profile Maximum Likelihood (PML) -distribution is:
[TABLE]
and is the maximum PML objective value.
Approximate profile maximum likelihood: For any -profile , a -distribution is a -approximate PML -distribution if
[TABLE]
Note: As in the case of one dimension, we extend and use the following definition for for any -vector. Further, for any probability terms defined in the future involving p, we assume those expressions are also defined for any -vector v just by replacing by everywhere and mean the same thing.
Probability discretization: Let be the set representing discretization of -probability space where for each , is a -level set. Further all elements in P are of the form for some fixed and for all possible index , where for each , is such that and .
Discrete -pseudodistribution: For any -distribution , its discrete -pseudodistribution is defined as:
[TABLE]
We use to denote the set of all discrete -pseudodistributions. Note that and .
Multiplicity discretization: Let be the set representing discretization of multiplicity space where each element represents a -frequency. Further each element is of the following form: for each , for some fixed and is such that , and as before . Note that .
Discrete -type: For a sequence , is its discrete -type if .
Discrete -profile: For a -sequence , is a discrete -profile if , where and is its -length.
E.2 Existence of Structured Approximate Solution
Here we show the existence of an approximate PML -distribution with a nice structure over the next several lemmas. First, we first show that one can assume the minimum non-zero probability of the PML -distribution is for each by only loosing in the PML objective value.
Lemma E.1** (Minimum probability lemma).**
For any -profile , there exists a -distribution such that is a -approximate PML -distribution with for all .
Proof.
See Section F.1. â
Next we show that working with discrete -level sets and -frequencies doesnât significantly decrease the PML objective value. Our next lemma formally proves this statement.
Lemma E.2** (Probability discretization lemma).**
For any -profile and -distribution , its discrete -pseudodistribution satisfies:
[TABLE]
Proof.
The first inequality is immediate because for all . To show second inequality consider any -sequence ,
[TABLE]
In the inequality above we use for all . Now,
[TABLE]
â
Our previous lemma showed that we can work in the discretized probability space and in our next lemma we show that discretization of multiplicities also doesnât change our objective value by much. For a -sequence , we first provide an equivalent formulation for the probability of its -profile (from Equation 20 in [OSZ03], Equation 15 in [PJW17]) in terms of its -type . The formulations provided [OSZ03], [PJW17] are for two dimensions and it is not hard to see these formulations generalize to higher dimension in the following way:
[TABLE]
where is the set of all permutations of domain set and is the number of domain elements with frequency (unseen domain elements). The difference between Equation 51 and Equation 50 is the index set over which they are summed.
Lemma E.3** (Profile discretization lemma).**
For any -distribution , and a -sequence :
[TABLE]
where and are the -profile and discrete -profile of respectively.
Proof.
Let and be -type and discrete -type of -sequence respectively. By Equation 51:
[TABLE]
Similarly:
[TABLE]
where is the number of unseen domain elements in profile . Note because our discretization procedure does not change the number of unseen domain elements. We now analyze both objectives term by term. For any permutation
[TABLE]
The first inequality above follows because and using we get the right hand side of the following inequality.
[TABLE]
Lets consider terms and , we upper bound their ratio next:
[TABLE]
Next we will lower bound the ratio considered above.
[TABLE]
Combining both we get:
[TABLE]
For final term consider all -frequencies generated by domain elements in -sequence . Observe that during our discretization procedure all -frequencies less than are never affected and we upper bound the number of -frequencies that change.
Analogous to proof in one dimension, for each , the number of domain elements with is less than . Further, the number of domain elements with for any is less than . The previous statement upper bounds . This further implies . Combining the previous reasoning with the fact that all -frequencies less than are never changed we get the following inequality.
[TABLE]
Combining previous inequality with eq. 52, eq. 53 we have our result. â
Our next corollary captures the impact of discretizing both probabilities and multiplicities.
Corollary E.4** (Discretization lemma).**
For any -distribution , and a -sequence . If is the discrete -distribution of p then,
[TABLE]
where and are the -profile and discrete -profile of respectively.
Proof.
Corollary follows immediately by combining Lemma E.2 and Lemma E.3. â
The discretization lemma above motivates the definition of a new objective function which we introduce and study next.
E.3 Discrete PML Optimization
Here we define a new optimization problem that can be solved efficiently and returns a -distribution which has a good approximation to the PML objective value. First we define the discrete profile maximum likelihood which is just the PML objective maximized over discrete -pseudodistributions.
Definition E.5** (Discrete profile maximum likelihood).**
Let be any -sequence, and be its -profile and discrete -profile respectively, a Discrete Profile Maximum Likelihood (DPML) -pseudodistribution is:
[TABLE]
is the maximum objective value.
Corollary E.6** (DPML is an approximate PML).**
For any -sequence ,
Proof.
Note that is a discrete -pseudodistribution. The result follows from E.4 applied to . â
In the next two lemmas we rephrase the DPML optimization problem in forms that are amenable to convex relaxation. To do this, we introduce some new notation.
Let be the matrix with rows indexed between to b and th row is equal to -level set . Also let be the vector with rows indexed between [math] to e. Its zeroth row (denoted by ) is equal to -frequency and th row is equal to -frequency . We use and to denote the th column of matrix and respectively.
Let be a variable matrix and we use for to denote elements of this matrix. As in the case for vector , our second index of variable matrix starts at [math] and not at . Here the variable counts the number of domain elements with -level set and have -frequency equal to . is counting the number of domain elements with -level set and -frequency equal to . We use function and to perform entrywise operations returning entities of same dimension as and respectively with applied on every entry.
For any matrix v and set , we use to denote the matrix with rows corresponding to index set .
For a discrete -profile (corresponding to -sequence ), define:
~{}~{}~{}~{}\textbf{K}_{\phi^{\prime}}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\{X\in\mathbb{Z}_{+}^{\textbf{b}\times(\textbf{e}+1)}~{}\Big{|}~{}~{}(X^{T}\mathrm{1})_{[1,\textbf{e}]}=\phi^{\prime},\text{ and }(X\mathrm{1})^{T}\zeta\leq 1\}
Note in the expression above is a -tuple and means each entry of this -tuple is less than 1 (as described in the preliminaries section).
For a discrete -profile (of ) and a discrete -pseudodistribution q, also define:
~{}~{}~{}~{}\textbf{K}_{\textbf{q},\phi^{\prime}}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\{X\in\mathbb{Z}_{+}^{\textbf{b}\times(\textbf{e}+1)}~{}\Big{|}~{}~{}(X^{T}\mathrm{1})_{[1,\textbf{e}]}=\phi^{\prime},\text{ and }X\mathrm{1}=\ell^{\textbf{q}}\} where and denote the number of domain elements with -level set in -pseudodistribution q.
One of the most important advantages of -level set and -frequency discretization we described earlier is that many -types in the set share the same probability value of being observed and our goal is to group them using the variables. Exploiting this idea, we next give a different formulation for the DPML objective.
Lemma E.7** (DPML objective reformulation).**
For any discrete -pseudodistribution and discrete -profile :
[TABLE]
Proof.
Recall from Equation 50
[TABLE]
For convenience, we call a -type valid if it belongs to set . Recall variable represents the number of domain elements with -level set and have -frequency equal to . In this representation and for the discrete -pseudodistribution q, each valid -type corresponds to the following unique variable assignment :
[TABLE]
and from the expression above it is not hard to write the exact expression for the probability term associated with the valid -type :
[TABLE]
For any variable assignment , it is clear from the middle term in Equation 56 that all valid -types associated with share the same probability value of being observed. With this observation, it is now enough to argue about the number of valid -types associated with a variable assignment to prove our lemma. We make this argument next by constructing all valid -types associated with .
First consider all domain elements with a fixed -level set and number of such elements is equal to . We can now generate part of a valid -type corresponding to the domain elements with -level set equal to by picking any partition of these domain elements into groups of sizes . This corresponds to multinomial coefficient and therefore the number of types associated with is just:
[TABLE]
Here we only generated partial valid -types corresponding to domain elements with -level set equal to . To generate a full valid -type we just need to combine these partial valid -types generated for each -level set . Let denote all such full valid -types associated with a variable assignment and generating a full valid -type corresponds to groups (for each -level set ) of independent possibilities considered conjointly. Further the cardinality of set is just the multiplication of cardinalities of each of these groups and is explicitly written below,
[TABLE]
We are almost done and all we do next is formally derive the expression in our lemma statement to complete the proof. From Equation 50,
[TABLE]
â
Lemma E.8** (DPML objective relaxed).**
For any -sequence , and a discrete -pseudodistribution the DPML objective can be upper bounded by:
[TABLE]
where is discrete -profile of .
Proof.
The proof follows because and invoking Lemma E.7. â
We are half way through in defining our final optimization problem which exhibits efficient algorithms. In our final optimization problem we just optimize over one term in the set instead of working with summation over all the terms and next two lemmas serve as the motivation for working with single term over the summation of terms by showing that the optimizing -pseudodistribution of our final optimization problem is still an approximate PML -distribution.
Lemma E.9** (Cardinality of ).**
For any -sequence and its associated discrete -profile :
[TABLE]
Proof.
is a set of vectors in and because of Lemma E.1 combined with the constraint , each takes only positive integer values less than . The lemma statement follows by substituting the values of b and e. â
As described earlier Lemma E.9 motivates us to consider the following objective, define:
[TABLE]
It is important to note that there is a discrete -pseudodistribution that correspond to each variable assignment . The description of this -distribution is as follows: For each , the number of domain elements that have -level set in q is equal to . This description only provides non zero -level sets and also does not provide any labels, however it is sufficient for estimating all symmetric properties mentioned in this paper.
Definition E.10** (Single discrete profile maximum likelihood).**
For any -sequence and its associated discrete -profile , a Single Discrete Profile Maximum Likelihood (SDPML) -pseudodistribution is:
[TABLE]
and is the -pseudodistribution corresponding to .
Lemma E.11** (SDPML relationd to PML).**
For any -sequence ,
[TABLE]
where and are -profile and discrete -profile associated with .
Proof.
[TABLE]
The second inequality follows from Lemma E.9, E.8 and last follows from E.6. â
E.4 Convex relaxation of SDPML
We showed in the previous subsection that the SDPML objective is a good approximation to the PML objective. However the objective function of SDPML is defined only over the integers and in this subsection we present a convex relaxation of SDPML.
First, we consider the feasible set of SDPML, which is the following integral polytope
[TABLE]
We relax the integer constraint on variables :
[TABLE]
In the later subsections, we show how we deal with these fractional solutions by presenting a rounding algorithm with a good approximation ratio.
Secondly, we relax the objective function of SDPML itself. The objective of SDPML is defined only on the integral set. We next define a continuous relaxation of this objective function which is also log-concave. To do so, we use an approximation of the factorial function (similar to Stirlingâs approximation) which handles terms as well. We use the following function as the continuous proxy of the SDPML objective (using the convention that ):
[TABLE]
The lemma below states that continuous version is not far from the actual SDPML objective.
Lemma E.12** ( approximates SDPML objective).**
For any -sequence and its associated discrete -profile . If , then
[TABLE]
Proof.
By Stirlingâs approximation for all integer :
[TABLE]
We slightly use a weaker version of this inequality that holds all integers ,
[TABLE]
[TABLE]
In the final inequality we used the fact that each , (Lemma E.1 combined with the constraint ensures this fact) and substituted the value of b. Also,
[TABLE]
â
A key fact about function is that it is log-concave, so we can apply optimization machinery from convex optimization.
Lemma E.13**.**
Function is log-concave in .
Proof.
Taking on both sides of Equation 60 we get,
[TABLE]
The first term is linear in and refer Lemma C.1 for the concavity of the second term. Combining both we get, is a sum of linear plus concave term and is therefore concave. Therefore, the function is concave. â
Maximizing log concave objective function over the relaxed convex set is a convex optimization problem and can be solved efficiently. Below is the convex relaxation of our SDPML objective which can be solved efficiently as summarized by our next theorem.
[TABLE]
Theorem E.14** (Solver for convex relaxation to SDPML).**
Optimization problem 61 can solved in time
Proof.
The optimization problem 61 is already in the form of optimization problem studied for one dimension in Appendix D. To invoke the result in Appendix D all we need is a lower bound on the minimum eigenvalue of matrix , where A is the constraint matrix when the optimization problem 61 is written in the vector form (described in Appendix D). We state this constraint matrix A for the optimization problem 61 and provide lower bound on the minimum eigenvalue of matrix in Section F.2. The number of variables in the optimization problem 61 is and the number of constraint is . In this notation of Appendix D, the value of parameters and and the running time we get for the optimization problem 61 is that stated in the lemma statement. â
E.5 Algorithm and Runtime Analysis
In this section we give an algorithm to find a -distribution that approximates PML objective and our analysis in previous sections suggests that it suffices to find a -distribution that approximates SDPML objective, which we replaced by a convex proxy. We now present an algorithm that takes an optimal solution to this convex proxy and produces a -distribution that approximates PML objective. Recall that \textbf{K}^{f}_{\phi^{\prime}}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\{X\in\mathbb{R}^{\textbf{b}\times(\textbf{e}+1)}~{}\big{|}~{}(X^{T}\mathrm{1})_{[1,\textbf{e}]}=\phi^{\prime},\text{ and }(X\mathrm{1})^{T}\zeta\leq 1\}.
The solution returned by the rounding procedure is defined on an extended discretized -probability space , where . To derive the relation between solution and PML objective value we need to extend some definitions studied earlier. First, we define as the matrix whose rows are exactly the elements of and we call it the extended -level set matrix. Note we still use for all to refer rows of . Further, for any -pseudodistribution q with for all (we call it extended discrete -pseudodistribution) and discrete -profile , we first define following extensions of sets and ,
[TABLE]
[TABLE]
where and denote the number of domain elements with -level set .
Further by Lemma E.7, for any extended discrete -pseudodistribution q and a discrete -profile , the following equality holds,
[TABLE]
Similarly for any , below are the natural extension of definitions of functions and ,
[TABLE]
We are now ready to analyze our rounding algorithm. First we provide some interesting properties solution returned by our rounding procedure satisfies,
Claim E.15**.**
The solution returned by rounding procedure (2) above satisfies:
** 2. 2.
.
Proof.
Claims (1) follows because for all . Now note because of the adjustments made by new level sets. Further,
[TABLE]
The final inequality follows because and therefore and Claim (2) follows. â
The solution returned by (4) always belongs to , further values and are close to each other and we summarize this result in our next lemma.
Lemma E.16**.**
For any returned by rounding procedure above satisfies:
[TABLE]
Proof.
For all integers , recall the weaker version of sterlings approximation we used earlier ,
[TABLE]
Now,
[TABLE]
and
[TABLE]
Now and for any , is a convex combination of elements in P and therefore for all . In the above expression we used the fact that each , for all (For any , and further combined with the constraint (because ) ensures this fact). Also,
[TABLE]
In the second inequality we used the fact that solution returned by our rounding procedure always satisfies for all , and . â
Using Equation 62, for any , if is its corresponding extended discrete -pseudodistribution, then
[TABLE]
Lemma E.17**.**
The solution returned by Algorithm 4 satisfies:
[TABLE]
Proof.
For any and returned by our rounding procedure below are the explicit expressions for and :
[TABLE]
[TABLE]
We first bound the probability term:
[TABLE]
Final expression above is the probability term associated with and the equation above shows that our rounding procedure only increases the probability term and all that matters is to bound the counting term that we do next.
[TABLE]
In the derivation above we used (1) in Claim E.15 and . It remains now to lower bound the quantity :
[TABLE]
The first and second inequality follow from Lemma E.16 and Equation 66 respectively. In the third inequality we used because is the optimal solution over the relaxed constraint set and finally invoked Lemma E.12 to relate and g. â
Now construct the -pseudodistribution corresponding to the solution returned by Algorithm 4 by assigning elements to -level set . Our next theorem proves that the -distribution is an approximate PML -distribution.
Theorem E.18** (Efficient and approximate PML for higher dimension).**
Let be a constant and be a -sequence of -length . Let be -tuples such that for each , , , we can compute an -approximate PML -distribution in time .
Proof.
Let be the -pseudodistribution corresponding to solution returned by Algorithm 4. Set , then:
[TABLE]
The first inequality follows because , second inequality from Lemma E.3, third inequality follows because (because we constructed from ) and computes just one term in the summation over (look at the representation of as summation over from Equation 64), fourth inequality comes from Lemma E.17 and last inequality follows from Lemma E.11.
The total running time of our algorithms is the following: Given a -sequence , it takes to write down the discrete -profile , then we need to solve the convex optimization problem 61 which further takes and our final rounding algorithm can be implemented in time (). The total running time combining all three steps in summarized in the lemma statement. â
To simplify the expression, for each substitute in the theorem above and in this parameter setting we achieve our best possible approximation ratio. See 3.5
E.6 Optimal sample complexity for KL divergence
In this section we study the connection between optimal estimation of KL divergence and approximate PML -distribution. We restate theorem of [ADOS16] we use earlier in one dimensional PML in terms of higher dimensional case.
Theorem E.19** (Theorem 4 of [ADOS16]).**
For a symmetric property f, suppose there is an estimator , such that for any p -distribution and observed -profile ,
[TABLE]
any -approximate PML distribution satisfies:
[TABLE]
Let p be a -distribution, meaning it is dimensional with two distributions and . Let be such that, , . We next define two conditions under which we get the optimal samples complexity for estimating KL divergence of distributions and . C1 , the estimation error satisfies . C2 .
Lemma E.20** (Theorem 5 of [Ach18]).**
Suppose C1 and C2 hold. Let be a fixed (small) constant. There are constant and such that if
[TABLE]
Given independent samples from distribution and independent samples from distribution , there exists an estimator for estimating KL divergence that satisfies,
[TABLE]
Theorem E.21** ([Das],[BPA97]).**
Let , and . The number of -profiles of -length equal to n is upper bounded by
[TABLE]
See 3.7
Proof.
Invoke Lemma E.20 with and E.19 with we get:
[TABLE]
In the first inequality we use Theorem E.21. â
Appendix F Remaining proofs for multidimensional PML
F.1 Minimum Probability
In this section we provide the proof for our first technical lemma which states that one can assume the minimum non-zero probability of the PML distribution is by only loosing a constant factor in the PML objective value. To show such a result we use an independent rounding algorithm described in the lemma below.
Claim F.1**.**
For any non-negative and non-zero -vector v and a -profile ,
[TABLE]
Proof.
[TABLE]
â
For notational convenience we need the following definition of K-profile maximum likelihood -distribution.
Definition F.2**.**
For any set , -distribution r and profile , the -profile maximum likelihood -distribution denote by is,
[TABLE]
Lemma F.3**.**
For any set , -distribution r, index and profile , there exists a -distribution such that,
[TABLE]
Proof.
We do independent rounding to show the existence of such a solution. For notational convenience let and for define and we fix all the probability values in these sets next.
For all define a random variable as follows:
[TABLE]
Clearly ,
[TABLE]
and in general for any integer power of random variable we have:
[TABLE]
For the remaining () with we define:
[TABLE]
Define and .
[TABLE]
[TABLE]
[TABLE]
Define p as follows:
[TABLE]
where is the concatenation of random vectors Y and Z. All random variables are mutually independent and we have:
[TABLE]
(From Equation 67,68 and the fact that is a constant).
We have a lower bound on the expected value of but this is misleading since p may not be a -distribution as could be greater than 1. Scaling norm of to 1 could significantly reduce the value of if is large. However, we show that a constant fraction of the expectation of comes from the sample space with bounded . Here is a constant and assume . Note that:
[TABLE]
The last inequality follows because Z is a constant random vector.
[TABLE]
To argue that a constant fraction of the expectation comes from the sample space with small we need a tight upper bound for:
[TABLE]
For , we first upper bound the probability term:
[TABLE]
We will use Chernoff bounds here and to apply them, we convert the random variables into Bernoulli random variables. Define ,
[TABLE]
Equivalently:
[TABLE]
Define and . For any ,
[TABLE]
Since is a sum of Bernoulli random variables, by Chernoff bounds:
[TABLE]
Note for all and further applying F.1 we get:
[TABLE]
[TABLE]
Substituting back in Equation 69 we have (for ),
[TABLE]
[TABLE]
[TABLE]
The above inequality implies existence of a with and . Define ,
[TABLE]
The above inequality further implies,
[TABLE]
[TABLE]
In the final inequality substitute and observe . Also our rounding procedure always ensures that minimum non-zero entry of is that further implies a lower bound on the minimum non-zero probability value of to be . Hence is our final distribution satisfying the conditions of lemma. â
See E.1
Proof.
The Lemma follows by induction and call to Lemma F.3.
Induction statement: For , let be the -distribution satisfying for all and is a -approximate PML -distribution.
Base Case: Apply Lemma F.3 by setting an empty set, and . Note that and the -distribution returned by Lemma F.3 is -approximate PML -distribution.
Induction step for : Apply Lemma F.3 by setting , and . Note that (By induction step) and the -distribution returned by Lemma F.3 further satisfies and is therefore a -approximate PML -distribution. Also by Lemma F.3 for all and . Combining everything we satisfy induction step for .
Set and by induction we get that induction step holds for and the lemma statement follows. â
F.2 Eigenvalue bounds for Gram matrix
Here we provide a lower bound for the minimum eigenvalue of a invertible Gram matrix. First, in Lemma F.4 we provide an explicit expression for the trace of inverse of a Gram matrix. Then, leveraging that we obtain Corollary F.5, our desired lower bound.
Lemma F.4**.**
For an invertible Gram matrix of a set of vectors .
[TABLE]
where is the orthogonal projection of onto .
Proof.
Recall,
[TABLE]
Let be the matrix with columns . For each we next give explicit formula for scalar . Let be the matrix with th column removed from matrix V. From the definition of and for all , the âth diagonal entry of is given by:
[TABLE]
Using Theorem (3) combined with Equation (3.2) in [Rot] we get,
[TABLE]
The lemma statement follows by substituting value of in Equation 72. â
Corollary F.5**.**
For an invertible Gram matrix of a set of vectors .
[TABLE]
where is the orthogonal projection of onto .
F.3 Singular value lower bound for constraint matrix
Here we show a lower bound for the minimum singular value of our constraint matrix A for multidimensional PML. First in Lemma F.6, we give a lower bound on the norm of orthogonal projection of each column onto span of remaining columns for the -level set matrix (defined in Section E.2). This result combined with F.5 gives a lower bound for the minimum singular value for . Then in Lemma F.8, we lower bound the minimum singular value of A in terms of minimum singular value of to achieve our desired lower bound.
Now, recall that P is the set of all vectors where for some , where for each is such that and . Further, the -level set matrix is the defined as the matrix whose rows are exactly the elements of P.
Lemma F.6**.**
For and , if is its âth column, then the following inequality holds,
[TABLE]
where is the orthogonal projection of onto .
Proof.
For each index , there are multiple blocks each of size and for each th block and and ,
[TABLE]
for each scalar and the number of blocks satisfying above equalities is equal to .
Note is same as and if is the orthogonal projection of onto , then:
[TABLE]
The above result combined with number of such blocks gives:
[TABLE]
â
Corollary F.7**.**
The minimum eigenvalue of matrix is at least .
Now lets consider our constraint matrix for multidimensional PML, 111111Our matrix A is a sparse matrix and matrix vector product with it can be computed in time
[TABLE]
Lemma F.8**.**
The eigenvalues of matrix are at least .
Proof.
Direct calculation shows that if , are e,b dimensional all ones vector respectively and is the e-dimensional identity matrix then for all and we have
[TABLE]
Consequently is an eigenvector of with eigenvalue if and only if
[TABLE]
Now if then we see the is an eigenvector if and only if in which case the eigenvalues are b. On the other hand if then we see is an eigenvector of eigenvalue if and only if
[TABLE]
When this happens we either have or in the case of the following holds,
[TABLE]
To simplify the expression above, let the following be the SVD for ,
[TABLE]
where are singular values and . In this notation the eigenvalue decomposition of matrix is equal to:
[TABLE]
Further we can write closed form expression for in terms of singular values and left singular value vectors of matrix .
[TABLE]
We use to denote the expression on the left hand side,
[TABLE]
We know that because is PSD. For , and is strictly increasing in . Further Equation 73 has a unique solution (if a solution exists) in the interval .
To give a lower bound of on , if suffices to find a , such that and we get . For , we have , then later combined with , we get and therefore . Combining all cases together we have that . Combined with F.7 we have our result.
â
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[Ach 18] Jayadev Acharya. Profile maximum likelihood is optimal for estimating kl divergence. 2018 IEEE International Symposium on Information Theory (ISIT) , pages 1400â1404, 2018.
- 2[ADM + 10] J. Acharya, H. Das, H. Mohimani, A. Orlitsky, and S. Pan. Exact calculation of pattern probabilities. In 2010 IEEE International Symposium on Information Theory , pages 1498â1502, June 2010.
- 3[ADOS 16] Jayadev Acharya, Hirakendu Das, Alon Orlitsky, and Ananda Theertha Suresh. A unified maximum likelihood approach for optimal distribution property estimation. Co RR , abs/1611.02960, 2016.
- 4[AOST 14] Jayadev Acharya, Alon Orlitsky, Ananda Theertha Suresh, and Himanshu Tyagi. The complexity of estimating rényi entropy. In Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms , 2014.
- 5[BPA 97] D. P. Bhatia, M. A. Prasad, and D. Arora. Asymptotic results for the number of multidimensional partitions of an integer and directed compact lattice animals. Journal of Physics A Mathematical General , 30:2281â2285, April 1997.
- 6[BZLV 16] Y. Bu, S. Zou, Y. Liang, and V. V. Veeravalli. Estimation of kl divergence between large-alphabet distributions. In 2016 IEEE International Symposium on Information Theory (ISIT) , pages 1118â1122, July 2016.
- 7[Das] Hirakendu Das. "competitive tests and estimators for properties of distributions", ph.d. dissertation, ucsd, 2012. https://pqdtopen.proquest.com/doc/1009080587.html?FMT=ABS .
- 8[ET 76] Bradley Efron and Ronald Thisted. Estimating the number of unsen species: How many words did shakespeare know? Biometrika , 63(3):435â447, 1976.
