Efficient Profile Maximum Likelihood for Universal Symmetric Property   Estimation

Moses Charikar; Kirankumar Shiragur; Aaron Sidford

arXiv:1905.08448·cs.DS·May 22, 2019

Efficient Profile Maximum Likelihood for Universal Symmetric Property Estimation

Moses Charikar, Kirankumar Shiragur, Aaron Sidford

PDF

1 Video

TL;DR

This paper introduces a nearly linear time algorithm for approximating the profile maximum likelihood (PML) distribution, enabling efficient universal estimation of symmetric properties of distributions with broad applications.

Contribution

It provides the first polynomial-time algorithm for approximate PML computation, facilitating universal symmetric property estimation in nearly linear time.

Findings

01

Algorithm computes approximate PML within exponential multiplicative error.

02

Enables universal plug-in estimators for all symmetric functions with high accuracy.

03

Extends to polynomial-time algorithms for multi-dimensional PML for symmetric relationships.

Abstract

Estimating symmetric properties of a distribution, e.g. support size, coverage, entropy, distance to uniformity, are among the most fundamental problems in algorithmic statistics. While each of these properties have been studied extensively and separate optimal estimators are known for each, in striking recent work, Acharya et al. 2016 showed that there is a single estimator that is competitive for all symmetric properties. This work proved that computing the distribution that approximately maximizes \emph{profile likelihood (PML)}, i.e. the probability of observed frequency of frequencies, and returning the value of the property on this distribution is sample competitive with respect to a broad class of estimators of symmetric properties. Further, they showed that even computing an approximation of the PML suffices to achieve such a universal plug-in estimator. Unfortunately, prior to…

Equations668

P (p, y^{n}) = def x \in D \prod p_{x}^{f (y^{n}, x)}

P (p, y^{n}) = def x \in D \prod p_{x}^{f (y^{n}, x)}

P (p, ψ) = def {y^{n} \in D^{n} ∣ Ψ (y^{n}) = ψ} \sum P (p, y^{n}) = (ψ n) x \in D \prod p_{x}^{ψ_{x}},

P (p, ψ) = def {y^{n} \in D^{n} ∣ Ψ (y^{n}) = ψ} \sum P (p, y^{n}) = (ψ n) x \in D \prod p_{x}^{ψ_{x}},

P (p, ϕ) = def {y^{n} \in D^{n} ∣ Φ (y^{n}) = ϕ} \sum P (p, y^{n})

P (p, ϕ) = def {y^{n} \in D^{n} ∣ Φ (y^{n}) = ϕ} \sum P (p, y^{n})

C_{ϕ} = def \frac{n !}{\prod _{j = 1}^{∣ D ∣} ( d _{j} ! ) ^{ϕ_{i}}}, \mbox w h er e n = j \sum d_{j} \cdot ϕ_{i}

C_{ϕ} = def \frac{n !}{\prod _{j = 1}^{∣ D ∣} ( d _{j} ! ) ^{ϕ_{i}}}, \mbox w h er e n = j \sum d_{j} \cdot ϕ_{i}

P (p, ϕ) = {y^{n} \in D^{n} ∣ Φ (y^{n}) = ϕ} \sum P (p, y^{n}) = {ψ \in Ψ^{n} ∣ Φ (ψ) = ϕ} \sum P (p, ψ) = C_{ϕ} {ψ \in Ψ^{n} ∣ Φ (ψ) = ϕ} \sum x \in D \prod p_{x}^{ψ_{x}}

P (p, ϕ) = {y^{n} \in D^{n} ∣ Φ (y^{n}) = ϕ} \sum P (p, y^{n}) = {ψ \in Ψ^{n} ∣ Φ (ψ) = ϕ} \sum P (p, ψ) = C_{ϕ} {ψ \in Ψ^{n} ∣ Φ (ψ) = ϕ} \sum x \in D \prod p_{x}^{ψ_{x}}

p_{p m l, ϕ} \in p \in Δ^{D} arg max P (p, ϕ)

p_{p m l, ϕ} \in p \in Δ^{D} arg max P (p, ϕ)

P (p_{p m l, ϕ}^{β}, ϕ) \geq β \cdot P (p_{p m l, ϕ}, ϕ)

P (p_{p m l, ϕ}^{β}, ϕ) \geq β \cdot P (p_{p m l, ϕ}, ϕ)

P (p_{a pp r o x}, ϕ) \geq exp (- O (ϵ_{1} n + ϵ_{2} n lo g n + \frac{lo g ^{3} n}{ϵ _{1} ϵ _{2}})) P (p_{p m l, ϕ}, ϕ)

P (p_{a pp r o x}, ϕ) \geq exp (- O (ϵ_{1} n + ϵ_{2} n lo g n + \frac{lo g ^{3} n}{ϵ _{1} ϵ _{2}})) P (p_{p m l, ϕ}, ϕ)

P (p, y^{n}) = def k = 1 \prod d x \in D \prod (p_{x} (k))^{f (y^{n (k)}, x)} .

P (p, y^{n}) = def k = 1 \prod d x \in D \prod (p_{x} (k))^{f (y^{n (k)}, x)} .

P (p, ϕ) = def {y^{n} \in D^{n} ∣ Φ (y^{n}) = ϕ} \sum P (p, y^{n}) .

P (p, ϕ) = def {y^{n} \in D^{n} ∣ Φ (y^{n}) = ϕ} \sum P (p, y^{n}) .

p_{p m l, ϕ} \in p \in Δ^{D, d} arg max P (p, ϕ)

p_{p m l, ϕ} \in p \in Δ^{D, d} arg max P (p, ϕ)

P (p_{p m l, ϕ}^{β}, ϕ) \geq β \cdot P (p_{p m l, ϕ}, ϕ)

P (p_{p m l, ϕ}^{β}, ϕ) \geq β \cdot P (p_{p m l, ϕ}, ϕ)

⌊ c ⌋_{S} = def s \in S : s \leq c max s and ⌈ c ⌉_{S} = def s \in S : s \geq c min s

⌊ c ⌋_{S} = def s \in S : s \leq c max s and ⌈ c ⌉_{S} = def s \in S : s \geq c min s

q_{x} = def ⌊ p_{x} ⌋_{P} \forall x \in D

q_{x} = def ⌊ p_{x} ⌋_{P} \forall x \in D

P (p, ϕ) \geq P (q, ϕ) \geq exp (- ϵ_{1} n) P (p, ϕ)

P (p, ϕ) \geq P (q, ϕ) \geq exp (- ϵ_{1} n) P (p, ϕ)

P (q, y^{n})

P (q, y^{n})

\geq exp (- ϵ_{1} n) P (p, y^{n})

P (q, ϕ) = {y^{n} \in D^{n} : Φ (y^{n}) = ϕ} \sum P (q, y^{n}) \geq {y^{n} \in D^{n} : Φ (y^{n}) = ϕ} \sum exp (- ϵ_{1} n) P (p, y^{n}) = exp (- ϵ_{1} n) P (p, ϕ)

P (q, ϕ) = {y^{n} \in D^{n} : Φ (y^{n}) = ϕ} \sum P (q, y^{n}) \geq {y^{n} \in D^{n} : Φ (y^{n}) = ϕ} \sum exp (- ϵ_{1} n) P (p, y^{n}) = exp (- ϵ_{1} n) P (p, ϕ)

P (p, ϕ^{'}) = def {y^{n^{'}} \in D^{n^{'}} ∣ Φ (y^{n^{'}}) = ϕ^{'}} \sum P (p, y^{n^{'}})

P (p, ϕ^{'}) = def {y^{n^{'}} \in D^{n^{'}} ∣ Φ (y^{n^{'}}) = ϕ^{'}} \sum P (p, y^{n^{'}})

exp (- 7 ϵ_{2} n lo g n) P (p, ϕ) \leq P (p, ϕ^{'}) \leq exp (7 ϵ_{2} n lo g n) P (p, ϕ)

exp (- 7 ϵ_{2} n lo g n) P (p, ϕ) \leq P (p, ϕ^{'}) \leq exp (7 ϵ_{2} n lo g n) P (p, ϕ)

exp (- (ϵ_{1} n + 7 ϵ_{2} n lo g n)) P (p, ϕ) \leq P (q, ϕ^{'}) \leq exp (ϵ_{1} n + 7 ϵ_{2} n lo g n) P (p, ϕ)

exp (- (ϵ_{1} n + 7 ϵ_{2} n lo g n)) P (p, ϕ) \leq P (q, ϕ^{'}) \leq exp (ϵ_{1} n + 7 ϵ_{2} n lo g n) P (p, ϕ)

q_{d p m l, ϕ^{'}} = def q \in Δ_{discrete}^{D} arg max P (q, ϕ^{'}),

q_{d p m l, ϕ^{'}} = def q \in Δ_{discrete}^{D} arg max P (q, ϕ^{'}),

P (q_{d p m l, ϕ^{'}}, ϕ^{'}) \geq exp (- (ϵ_{1} n + 7 ϵ_{2} n lo g n)) P (p_{p m l, ϕ}, ϕ)

P (q_{d p m l, ϕ^{'}}, ϕ^{'}) \geq exp (- (ϵ_{1} n + 7 ϵ_{2} n lo g n)) P (p_{p m l, ϕ}, ϕ)

P (q, ϕ^{'}) = C_{ϕ^{'}} X \in K_{q, ϕ^{'}} \sum i = 1 \prod b_{1} (ζ_{i}^{(X m)_{i}} \frac{( X 1 ) _{i} !}{\prod _{j = 0}^{b_{2}} X _{ij} !})

P (q, ϕ^{'}) = C_{ϕ^{'}} X \in K_{q, ϕ^{'}} \sum i = 1 \prod b_{1} (ζ_{i}^{(X m)_{i}} \frac{( X 1 ) _{i} !}{\prod _{j = 0}^{b_{2}} X _{ij} !})

P (q, ϕ^{'}) = C_{ϕ^{'}} {ψ ∣ Φ (ψ) = ϕ^{'}} \sum x \in X \prod q_{x}^{ψ_{x}} .

P (q, ϕ^{'}) = C_{ϕ^{'}} {ψ ∣ Φ (ψ) = ϕ^{'}} \sum x \in X \prod q_{x}^{ψ_{x}} .

x \in X \prod q_{x}^{ψ_{x}} = i = 1 \prod b_{1} j = 0 \prod b_{2} ζ_{i}^{X_{ij} m_{j}} = i = 1 \prod b_{1} ζ_{i}^{(X m)_{i}}

x \in X \prod q_{x}^{ψ_{x}} = i = 1 \prod b_{1} j = 0 \prod b_{2} ζ_{i}^{X_{ij} m_{j}} = i = 1 \prod b_{1} ζ_{i}^{(X m)_{i}}

\frac{( X 1 ) _{i} !}{\prod _{j = 0}^{b_{2}} X _{ij} !} .

\frac{( X 1 ) _{i} !}{\prod _{j = 0}^{b_{2}} X _{ij} !} .

∣ S_{X} ∣ = i = 1 \prod b_{1} \frac{( X 1 ) _{i} !}{\prod _{j = 0}^{b_{2}} X _{ij} !} .

∣ S_{X} ∣ = i = 1 \prod b_{1} \frac{( X 1 ) _{i} !}{\prod _{j = 0}^{b_{2}} X _{ij} !} .

P (q, ϕ^{'})

P (q, ϕ^{'})

= C_{ϕ^{'}} X \in K_{q, ϕ^{'}} \sum {ψ \in S_{X}} \sum i = 1 \prod b_{1} ζ_{i}^{(X m)_{i}} = C_{ϕ^{'}} X \in K_{q, ϕ^{'}} \sum ∣ S_{X} ∣ i = 1 \prod b_{1} ζ_{i}^{(X m)_{i}}

= C_{ϕ^{'}} X \in K_{q, ϕ^{'}} \sum i = 1 \prod b_{1} (ζ_{i}^{(X m)_{i}} \frac{( X 1 ) _{i} !}{\prod _{j = 0}^{b_{2}} X _{ij} !})

P (q, ϕ^{'}) \leq C_{ϕ^{'}} X \in K_{ϕ^{'}} \sum i = 1 \prod b_{1} (ζ_{i}^{(X m)_{i}} \frac{( X 1 ) _{i} !}{\prod _{j = 0}^{b_{2}} X _{ij} !})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Efficient Profile Maximum Likelihood for Universal Symmetric Property Estimation· youtube

Full text

Efficient Profile Maximum Likelihood for

Universal Symmetric Property Estimation

Moses Charikar

Stanford University

[email protected] Supported by NSF grant CCF-1617577, a Simons Investigator Award and a Google Faculty Research Award.

Kirankumar Shiragur

Stanford University

[email protected]

Aaron Sidford

Stanford University

[email protected] Supported by NSF CAREER Award CCF-1844855.

Abstract

Estimating symmetric properties of a distribution, e.g. support size, coverage, entropy, distance to uniformity, are among the most fundamental problems in algorithmic statistics. While each of these properties have been studied extensively and separate optimal estimators are known for each, in striking recent work, Acharya et al. [ADOS16] showed that there is a single estimator that is competitive for all symmetric properties. This work proved that computing the distribution that approximately maximizes profile likelihood (PML), i.e. the probability of observed frequency of frequencies, and returning the value of the property on this distribution is sample competitive with respect to a broad class of estimators of symmetric properties. Further, they showed that even computing an approximation of the PML suffices to achieve such a universal plug-in estimator. Unfortunately, prior to this work there was no known polynomial time algorithm to compute an approximate PML and it was open to obtain a polynomial time universal plug-in estimator through the use of approximate PML.

In this paper we provide a algorithm (in number of samples) that, given $n$ samples from a distribution, computes an approximate PML distribution up to a multiplicative error of $\exp(n^{2/3}\mathrm{poly}\log(n))$ in time nearly linear in $n$ . Generalizing work of [ADOS16] on the utility of approximate PML we show that our algorithm provides a nearly linear time universal plug-in estimator for all symmetric functions up to accuracy $\epsilon=\Omega(n^{-0.166})$ . Further, we show how to extend our work to provide efficient polynomial-time algorithms for computing a $d$ -dimensional generalization of PML (for constant $d$ ) that allows for universal plug-in estimation of symmetric relationships between distributions.

1 Introduction

Estimating a symmetric property of a distribution given a small number of samples is a fundamental problem in algorithmic statistics. Formally, a property is symmetric if it is invariant to permutation of the labels, i.e. it is a function only of the multiset of probabilities and does not depend on the symbol labels. For many natural properties, including support size, coverage, distance from uniform and entropy, there has been extensive work that has led to designing efficient estimators both with respect to computational time and sample complexity [HJWW17, HJM17, AOST14, RVZ17, ZVV*+*16, WY16b, RRSS07, WY15, OSW16, VV11b, WY16a, JVHW15, JHW16, VV11a]. In many cases these estimators are tailored to the particular property of interest. This paper is motivated by the goals of unifying the development of efficient estimators of symmetric properties of distributions and designing a single efficient universal algorithm for estimating arbitrary symmetric properties of distributions.

Our approach stems from the observation that a sufficient statistic for the problem of estimating a symmetric property from a sequence of samples is the profile of the sequence, i.e. the multiset of the frequencies (i.e multiplicities) of symbols in the sequence, e.g. the profile of $ababc$ is $\{2,2,1\}$ . Profiles are also called histograms of histograms, histogram order statistics, or fingerprints. Our approach to obtaining a universal estimator is based on the elegant problem of profile maximum likelihood (PML) introduced by Orlitsky et al. [OSS*+*04]: Given a sequence of $n$ samples, find the distribution that maximizes the probability of the observed profile. This problem has been studied in several papers since, applying heuristic approaches such as Bethe approximation [Von12, Von14], the EM algorithm [OSS*+*04], and some algebraic approaches [ADM*+*10] to calculate the PML. Recently Pavlichin, Jiao and Weissman [PJW17] introduced an efficient dynamic programming heuristic for PML that can be computed in linear time. While there are no approximation guarantees for the solution they produce, their approach was the initial impetus for our work.

A recent paper of Acharya et al. [ADOS16] showed that a distribution that optimizes the PML objective can be used to obtain a plug-in estimator for various symmetric properties of distributions. In fact it suffices to compute a distribution that approximates the PML objective to within a factor $\exp(n^{1-\delta})$ for constant $\delta>0$ where $n$ is the size of the sample. Unfortunately, no polynomial time computable PML estimator with such an approximation guarantee was known previously. In this paper, we provide an estimator with an approximation factor of $\exp(n^{2/3}\mathrm{poly}\log(n))$ , leading to a universal estimator for a host of symmetric properties. Moreover, our estimator is computable in time nearly linear in $n$ . Our techniques extend to computing a $d$ -dimensional generalization of PML, where we have access to samples from multiple distributions on a common domain. This allows for universal plug-in estimation of various symmetric relationships between multiple distributions.

1.1 Overview of approach

The bulk of our work is dedicated to find a distribution that approximately maximizes the PML objective within an $\exp(n^{1-\delta})$ factor for a constant $\delta>0$ . We call such a distribution an approximate PML distribution. Given a sequence $y^{n}$ and its corresponding profile $\phi$ , the PML optimization problem is a maximization problem over all distributions $\textbf{p}\in\Delta^{\mathcal{D}}$ . The objective function of the PML optimization problem is the probability of observing profile $\phi$ with respect to a distribution $\textbf{p}\in\Delta^{\mathcal{D}}$ , which in turn is equal to the summation of probabilities of sequences (with respect to p) that have $\phi$ as their corresponding profile. The distribution that maximizes this objective is called a profile maximum likelihood (PML) distribution. (See Section 2 for formal definitions.)

To efficiently compute an approximate PML distribution, we first restrict ourselves to maximizing the PML objective for a discretized version of the profile over a class of distributions we call discrete pseudo-distributions (See Section 4). Here, the probability values of the distribution are restricted to belong to a small set P of permissible values (See Section 4.1)), and the frequencies in the profile are similarly restricted to belong to a small set M (See Section 4.2). We call the resulting maximizing distribution, a discrete PML (DPML) distribution and the corresponding optimization problem as DPML optimization (See Section 4.3).

There are two main features of the DPML optimization problem. Firstly, the maximizing distribution DPML is an approximate PML distribution with an approximation guarantee that we can control (as a function of the sizes of P and M). Secondly, the DPML optimization problem has a simpler equivalent formulation, in which sequences that have the same associated probability value with respect to a discrete pseudo-distribution are combined together into sub groups and the whole summation is written as a summation over a small number of subgroups. The number of these subgroups is a function of the sizes of P and M which we control (See Section 4.3 for both these results).

As an illustration of DPML, consider the profile $\{2,1,1\}$ and a probability distribution on 5 elements: two with a value of $\frac{1}{4}$ and three with a value of $\frac{1}{6}$ . Note that the probability values come from the set $\textbf{P}=\{1/4,1/6\}$ . One way to get the profile $\{2,1,1\}$ is to have an element of probability $1/4$ appear twice and two elements of probability $1/6$ appear once. There are ${2\choose 1}{3\choose 2}$ choices of such elements and for each such choice, $\frac{4!}{2!\cdot 1!\cdot 1!}$ sequences of length 4 with these elements. The probability of any such sequence is the same: $\left(\frac{1}{4}\right)^{2}\left(\frac{1}{6}\right)\left(\frac{1}{6}\right)$ . We consider the set of all these sequences as one subgroup. Different subgroups are identified by specifying, for each permissible probability value, the frequencies with which elements of that probability value are seen in the sample. The DPML objective then sums up the contributions of each such subgroup.

Reformulating the problem in terms of summation over a small number of subgroups is crucial to our approach. It allows us to focus on the subgroup that gives the largest contribution to the objective instead of summing over all the subgroups. We call the optimization problem that optimizes the contribution of a single subgroup (instead of summing over all terms) as single discrete PML (SDPML). We show that the SDPML optimization problem has a convex relaxation and can be solved efficiently. Since there were a small number of these subgroups in the summation, the optimizing discrete pseudo-distribution that optimizes over just one subgroup has objective function value that is lower by at most the number of subgroups. Hence the maximizing discrete pseudo-distribution for this new objective function approximately optimizes the earlier objectives (PML and DPML) with bounded loss (See Section 4.3).

Ultimately, our algorithm first solves this convex relaxation to the SDPML optimization problem to obtain a fractional solution (in some representation space of these discrete pseudo-distributions) (See Section 4.4). Then we apply a rounding algorithm that finds a distribution which maintains the approximation guarantee need to obtain an approximate PML distribution (See Section 4.5).

1.2 Related work

As discussed in the introduction, PML was introduced by Orlitsky et al. [OSS*+*04] in 2004. Many heuristic approaches such as Bethe approximation [Von12, Von14], the EM algorithm [OSS*+*04], algebraic approaches [ADM*+*10] and a dynamic programming approach [PJW17] have been proposed to calculate the approximate PML.

The connection between PML and universal estimators was first studied in [ADOS16]. There have been several other approaches for designing universal estimators for symmetric properties. Valiant and Valiant [VV11b] adopted and rigorously analyzed a linear programming based approach for universal estimators proposed by [ET76] and showed that it is sample complexity optimal in the constant error regime for estimating certain symmetric properties (namely, entropy, support size, support coverage, and distance to uniformity). Recent work of Han, Jiao and Weissman [HJW18] applied a local moment matching based approach in designing efficient universal symmetric property estimators for a single distribution. [HJW18] achieves the optimal sample complexity in all error regimes for estimating the power sum function, support and entropy.

Estimating symmetric properties of a distribution is a rich field and extensive work has been dedicated to studying their optimal sample complexity for estimating each of these properties. Optimal sample complexities for estimating many symmetric properties were resolved in the past few years, including all the properties studied here: support [VV11b, WY15], support coverage [OSW16, ZVV*+*16], entropy [VV11b, WY16a] and distance from uniform [VV11a, JHW16].

Symmetric properties for distribution pairs have been studied in the literature as well. For instance, optimal sample complexity for estimation of KL divergence between two distributions were given by [BZLV16, HJW16].

1.3 Paper organization

The rest of the paper is structured as follows. In Section 2, we provide definitions and notations. In Section 3, we state our main results of the paper. Our main contribution is to provide an algorithm that efficiently compute an approximate PML and in Section 4 we prove this result. In this section, we also present an almost linear time algorithm based on cutting plane methods for solving our convex relaxation to SDPML; however we defer all of its analysis to the appendix. Finally, in Section 5, we provide the connection between approximate PML distribution and a universal estimator for symmetric property estimation. The proof presented in [ADOS16] showed this connection for an $\exp(\sqrt{n})$ -approximate PML estimator and we show it for an $\exp(n^{2/3})$ -approximate PML estimator. However it is easy to see the proof presented in [ADOS16] works for any $\exp(n^{1-\delta})$ -approximate PML estimator for constant $\delta>0$ . In Appendix E we show that the techniques presented here generalize to a higher dimensional version of PML.

2 Preliminaries

Let $[a,b]$ and $[a,b]_{\mathbb{R}}$ denote the interval of integers and reals $\geq a$ and $\leq b$ respectively and let $[a]\stackrel{{\scriptstyle\mathrm{def}}}{{=}}[1,a]$ . Let $\Delta^{\mathcal{D}}\subset[0,1]_{\mathbb{R}}^{\mathcal{D}}$ be the set of all distributions supported on domain $\mathcal{D}$ and let $N$ be the size of the domain. We use the word distribution to refer to discrete distributions. Throughout this paper we assume that we receive a sequence of $n$ independent samples from an underlying distribution $\textbf{p}\in\Delta^{\mathcal{D}}$ . Let $\mathcal{D}^{n}$ be the set of all length $n$ sequences and $y^{n}\in\mathcal{D}^{n}$ be one such sequence with $y^{n}_{i}$ denoting its $i$ th element. The probability of observing sequence $y^{n}$ is:

[TABLE]

where $\textbf{f}(y^{n},x)=|\{i\in[n]~{}|~{}y^{n}_{i}=x\}|$ is the frequency (multiplicity) of symbol $x$ in sequence $y^{n}$ and $\textbf{p}_{x}$ is the probability of domain element $x\in\mathcal{D}$ .

We extend and use the definition for $\mathbb{P}(\textbf{v},y^{n})$ to any vector $\textbf{v}\in\mathbb{R}^{\mathcal{D}}$ by letting $\mathbb{P}(\textbf{v},y^{n})\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\prod_{x\in\mathcal{D}}\textbf{v}_{x}^{\textbf{f}(y^{n},x)}$ . Further, for functions of probability distributions p, we assume those expressions are also defined for any vector $\textbf{v}\in\mathbb{R}^{\mathcal{D}}$ just by replacing $\textbf{p}_{x}$ by $\textbf{v}_{x}$ everywhere.

For any given sequence one could define its type (histogram) and profile (histogram of a histogram or fingerprint) that are sufficient statistics for symmetric property estimation. The histogram of histogram perspective comes from viewing type as a histogram and profile as histogram of type.

Definition 2.1 (Type).

A type $\psi=\Psi(y^{n})\in\mathbb{Z}_{+}^{\mathcal{D}}$ of a sequence $y^{n}\in\mathcal{D}^{n}$ is the vector of frequencies $\psi_{x}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\textbf{f}(y^{n},x)$ of domain elements in $y^{n}$ . We call $n$ the length of type $\psi$ and use $\Psi^{n}$ to represent the set of all types of length $n$ .

To simplify notation we use just $\psi$ to denote type and the associated sequence will be clear from context. For a distribution $\textbf{p}\in\Delta^{\mathcal{D}}$ , the probability of a type $\psi\in\Psi^{n}$ is:

[TABLE]

where $\binom{n}{\psi}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\frac{n!}{\prod_{x\in\mathcal{D}}\psi_{x}!}$ and $0!\stackrel{{\scriptstyle\mathrm{def}}}{{=}}1$ .

Definition 2.2 (Profile).

For any sequence $y^{n}\in\mathcal{D}^{n}$ , let $\textbf{D}=\{\textbf{f}(y^{n},x)\}_{x\in\mathcal{D}}$ be the set of all its distinct frequencies and $d_{1},d_{2},\dots,d_{|\textbf{D}|}$ be elements of the set D. The profile of a sequence $y^{n}\in\mathcal{D}^{n}$ denoted $\phi=\Phi(y^{n})\in\mathbb{Z}_{+}^{|\textbf{D}|}$ is $\phi\stackrel{{\scriptstyle\mathrm{def}}}{{=}}(\phi_{j})_{j=1\dots|\textbf{D}|}$ where $\phi_{j}=\phi_{j}(y^{n})\stackrel{{\scriptstyle\mathrm{def}}}{{=}}|\{x\in\mathcal{D}~{}|~{}\textbf{f}(y^{n},x)=d_{j}\}|$ is the number of domain elements with frequency $d_{j}$ in $y^{n}$ . We call $n$ the length of profile $\phi$ and as a function of profile $\phi$ , $n=\sum_{j}d_{j}\cdot\phi_{j}$ . We let $\Phi^{n}$ denote the set of all profiles of length $n$ . 111The number of unseen domain elements is not part of the profile, because the domain size is unknown.

For any distribution $\textbf{p}\in\Delta^{\mathcal{D}}$ , the probability of a profile $\phi\in\Phi^{n}$ is defined as:

[TABLE]

One can also define the profile of a type $\psi$ . We overload notation and use $\phi=\Phi(\psi)$ to denote the profile associated with type $\psi$ and $\phi_{j}=\phi_{j}(\psi)\stackrel{{\scriptstyle\mathrm{def}}}{{=}}|\{x\in\mathcal{D}~{}|~{}\psi_{x}=d_{j}\}|$ .

For future use, we also write the probability of a profile $\phi\in\Phi^{n}$ in terms of its types. All types $\psi$ with $\Phi(\psi)=\phi$ have the same $\binom{n}{\psi}$ value and we use notation $C_{\phi}$ to represent this quantity. The explicit expression for $C_{\phi}$ is written below:

[TABLE]

We next derive an expression for the probability of a profile in terms of its types:

[TABLE]

The distribution which maximizes the probability of a profile $\phi$ is called a profile maximum likelihood distribution.

Definition 2.3 (Profile maximum likelihood).

For any profile $\phi\in\Phi^{n}$ , a profile maximum likelihood (PML) distribution $\textbf{p}_{pml,\phi}\in\Delta^{\mathcal{D}}$ is:

[TABLE]

and $\mathbb{P}(\textbf{p}_{pml,\phi},\phi)$ is the maximum PML objective value.

The central goal of this paper is to define efficient algorithms for computing approximate PML distributions defined as follows.

Definition 2.4 (Approximate PML).

For any profile $\phi\in\Phi^{n}$ , a distribution $\textbf{p}^{\beta}_{pml,\phi}\in\Delta^{\mathcal{D}}$ is a $\beta$ -approximate PML distribution if

[TABLE]

Throughout this paper we use the phrase approximate PML to denote a $\beta$ -approximate PML distribution for some non-trivial $\beta$ .

2.1 Representation of a profile

For any profile $\phi\in\Phi^{n}$ , we represent $\phi$ using the set of $(frequency,count)$ tuples, where a tuple $(a,b)$ denotes that $b$ number of domain elements have frequency $a$ in the sequence. We use $\phi_{size}$ to denote the size of profile $\phi$ in this representation. It is not hard to see that for any length $n$ profile $\phi_{size}\in O(\sqrt{n})$ . Further it takes $O(n)$ time to write the profile in this representation.

For all our algorithmic results, when we are given a profile, we assume the above representation. We will explicitly state running times when we start with a sequence instead of a profile.

3 Results

Here we state the main results of this paper. Our first main theorem provides an algorithm to efficiently compute an approximate PML distribution. Our approximation guarantee in this result is something that depends on the running time itself and we can achieve sub-linear running times (in size of the sample) if we allow for weaker approximation guarantees.

Theorem 3.1 (Efficient and approximate PML distribution).

Given a profile $\phi\in\Phi^{n}$ , let $\textbf{p}_{pml}$ be its corresponding PML distribution. There is an algorithm that for any $\frac{1}{\mathrm{poly}(n)}<\epsilon_{1},\epsilon_{2}<1$ , computes an $\exp(-O(\epsilon_{1}n+\epsilon_{2}n\log n+\frac{\log^{3}n}{\epsilon_{1}\epsilon_{2}}))$ -approximate PML distribution $\textbf{p}_{approx}$ , i.e.

[TABLE]

in $O\left(\phi_{size}+\frac{1}{\epsilon_{2}^{2}\epsilon_{1}}\log^{O(1)}(\frac{1}{\epsilon_{1}\epsilon_{2}})+\frac{1}{\epsilon_{2}^{3}}\log^{O(1)}(\frac{1}{\epsilon_{1}\epsilon_{2}})\right)$ time. Using $\phi_{size}\in O(\sqrt{n})$ this running time simplifies to $O\left(\sqrt{n}+\frac{1}{\epsilon_{2}^{2}\epsilon_{1}}\log^{O(1)}(\frac{1}{\epsilon_{1}\epsilon_{2}})+\frac{1}{\epsilon_{2}^{3}}\log^{O(1)}(\frac{1}{\epsilon_{1}\epsilon_{2}})\right)$ .

In the above result, the best approximation is achieved for $\epsilon_{1},\epsilon_{2}=n^{-1/3}$ and we get an $\exp(-O(n^{2/3}\log^{3}n))$ -approximate PML distribution in nearly linear time (in the number of samples). This result is summarized below.

Corollary 3.2 (Nearly linear time $\exp(-O(n^{2/3}\log^{3}n))$ - approximate PML distribution).

Let $y^{n}\in\mathcal{D}^{n}$ be a sequence and $\phi=\Phi(y^{n})$ be its corresponding profile. There is an algorithm that computes an $\exp(-O(n^{2/3}\log^{3}n))$ -approximate PML distribution in time $\widetilde{O}(n)$ .

This results constitutes the first polynomial time algorithm to compute an $\exp(-n^{1-\delta})$ -approximate PML for any constant $\delta>0$ . In the corollary above we start with a sequence instead of a profile; in this case our algorithm still runs in $\widetilde{O}(n)$ because we only need $O(n)$ time to compute the profile of a sequence in the representation discussed in Section 2.1.

Our next result relates an approximate PML distribution to a universal plug-in estimator that is sample complexity optimal for support size, coverage, entropy and distance from uniform. In Section 5, we prove this result. However it is easy to see the proof presented in Section 5 proves a more general result that approximate PML is sample complexity optimal for a broad class of symmetric properties $\textbf{f}(\cdot)$ satisfying certain conditions. One such set of conditions (informally) is the existence of an estimator $\widehat{\textbf{f}}$ for $\textbf{f}(\cdot)$ with following properties: $(1)$ the estimator $\widehat{\textbf{f}}$ is sample complexity optimal, $(2)$ the estimator $\widehat{\textbf{f}}$ has low bias, and $(3)$ the output of the estimator is not changed by much when we change any individual sample. This result was already shown in [ADOS16] for an $\exp(-n^{0.5})$ -approximate PML distribution. Using the same proof with slight modifications we get the following result.

Theorem 3.3 (Universal estimator using approximate PML).

Let $n$ be the optimal sample complexity of estimating entropy, support, support coverage and distance to uniformity and $c$ be a large positive constant. Let $\epsilon\geq\frac{3c}{n^{1/6-\eta}}$ for any constant $\eta>0$ , then for any $\beta>\exp(-O(n^{2/3}\log^{3}n))$ , the $\beta$ -approximate PML estimator estimates entropy, support, support coverage, and distance to uniformity to an accuracy of $4\epsilon$ with probability at least $1-\exp(-n^{2/3})$ .

Setting $\eta=1/6-0.166$ in the theorem above and combined with 3.2, we obtain the following result.

Theorem 3.4 (Efficient universal estimator using approximate PML).

Let $n$ be the optimal sample complexity of estimating entropy, support, support coverage and distance to uniformity. If $\epsilon\geq\frac{3c}{n^{0.166}}$ , then there exists a PML based universal plug-in estimator that runs in time $\widetilde{O}(n)$ and is sample complexity optimal for estimating entropy, support, support coverage and distance to uniformity to accuracy $4\epsilon$ .

Our techniques for PML are general and can be extended to a generalization of PML to multiple dimensions (multidimensional PML). We provide a polynomial time (in number of samples) algorithm to compute approximate PML in multiple dimensions when the number of dimensions is constant. This allows for universal plug-in estimation of various symmetric relationships between multiple distributions. We next formally define and state our main results for multidimensional PML.

3.1 Results for multidimensional PML

First we describe the multidimensional setting, then we define multidimensional PML, and then state our main results. Throughout this paper we assume the number of dimensions is constant.

Multidimensional setup:

For each $k\in[1,d]$ , we receive a sequence $\textbf{y}^{\textbf{n}(k)}$ that consists of $\textbf{n}(k)$ independent samples drawn from an underlying distribution $\textbf{p}(k)$ supported on same domain $\mathcal{D}$ ( $N\stackrel{{\scriptstyle\mathrm{def}}}{{=}}|\mathcal{D}|$ ), further $\textbf{y}^{\textbf{n}(k)}$ is independent of other sequences $\textbf{y}^{\textbf{n}(k^{\prime})}$ for $k^{\prime}\in[1,d]$ and $k^{\prime}\neq k$ . We call $\textbf{y}^{\textbf{n}}=(\textbf{y}^{\textbf{n}(1)},\dots\textbf{y}^{\textbf{n}(d)})$ a $d$ -sequence and $\textbf{n}=(\textbf{n}(1),\dots,\textbf{n}(d))$ its $d$ -length. Let $\mathcal{D}^{\textbf{n}}$ be the set of all $d$ -sequences of $d$ -length equal to n. We use $\textbf{p}_{x}(k)$ to denote the probability of domain element $x$ in distribution $\textbf{p}(k)$ . We also refer to $\textbf{p}=(\textbf{p}(1),\dots,\textbf{p}(d))$ as a $d$ -distribution and let $\Delta^{\mathcal{D},d}$ denote the set of all $d$ -distributions.

For any $d$ -distribution $\textbf{p}\in\Delta^{\mathcal{D},d}$ , the probability of a $d$ -sequence $\textbf{y}^{\textbf{n}}$ is defined as:

[TABLE]

Recall that for each $k\in[1,d]$ , $\textbf{f}(\textbf{y}^{\textbf{n}(k)},x)$ is the frequency of domain element $x$ in sequence $\textbf{y}^{\textbf{n}(k)}$ . For any $d$ -sequence $\textbf{y}^{\textbf{n}}$ , we call $\textbf{f}(\textbf{y}^{\textbf{n}},x)=(\textbf{f}(\textbf{y}^{\textbf{n}(1)},x),\dots,\textbf{f}(\textbf{y}^{\textbf{n}(d)},x))$ the $d$ -frequency of domain element $x$ in $\textbf{y}^{\textbf{n}}$ . Let $\textbf{F}^{\textbf{n}}$ be the set of all $d$ -frequencies generated by different domain elements in all possible $d$ -sequences in $\mathcal{D}^{\textbf{n}}$ and we let $\textbf{e}_{j}\in\textbf{F}^{\textbf{n}}$ denote its $j$ th element. We next define multidimensional generalizations of profile, PML, and approximate PML.

$d$ -Profile:

For any $d$ -sequence $\textbf{y}^{\textbf{n}}\in\mathcal{D}^{\textbf{n}}$ , we call $\phi=\Phi(\textbf{y}^{\textbf{n}})$ a $d$ -profile if $\phi=(\phi_{j})_{j=1\dots|\textbf{F}^{\textbf{n}}|}$ and $\phi_{j}=|\{x\in\mathcal{D}~{}|~{}\textbf{f}(\textbf{y}^{\textbf{n}},x)=\textbf{e}_{j}\}|$ is the number of domain elements with $d$ -frequency $\textbf{e}_{j}$ . We call n the $d$ -length of $\phi$ and use $\Phi^{n}$ to denote the set of all $d$ -profiles of $d$ -length equal to n. For any $d$ -distribution $\textbf{p}\in\Delta^{\mathcal{D},d}$ , the probability of a $d$ -profile $\phi\in\Phi^{n}$ is defined as:

[TABLE]

Profile maximum likelihood:

For any $d$ -profile $\phi\in\Phi^{n}$ , a Profile Maximum Likelihood $d$ -distribution $\textbf{p}_{pml,\phi}\in\Delta^{\mathcal{D},d}$ is:

[TABLE]

and $\mathbb{P}(\textbf{p}_{pml,\phi},\phi)$ is the maximum PML objective value.

Approximate profile maximum likelihood:

For any $d$ -profile $\phi\in\Phi^{n}$ , a $d$ -distribution $\textbf{p}^{\beta}_{pml,\phi}\in\Delta^{\mathcal{D},d}$ is a $\beta$ -approximate PML $d$ -distribution if

[TABLE]

.

We next state our results for approximate PML $d$ -distributions. In Footnote 2, we give a algorithm to efficiently compute an approximate PML $d$ -distribution. Then, we substitute $d=2$ in this result to get 3.6.

Theorem 3.5 (Efficient and approximate multidimensional PML).

Let $\textbf{y}^{\textbf{n}}$ be a $d$ -sequence of $d$ -length $\textbf{n}=(\textbf{n}(1),\dots,\textbf{n}(d))$ . There is an algorithm that computes an $\exp\left(-\widetilde{O}\left(\sum_{k=1}^{d}\textbf{n}(k)^{1-1/(2d+1)}\right)\right)$ -approximate PML $d$ -distribution $\textbf{p}_{\mathrm{approx}}$ in $\widetilde{O}(\sum_{k=1}^{d}\textbf{n}(k)+\prod_{k=1}^{d}\textbf{n}(k)^{3/(2d+1)})$ time222Here $\widetilde{O}$ notation hides all $\prod_{k=1}^{d}\log^{O(1)}\textbf{n}(k)$ terms and therefore $O(d)$ term as well..

Corollary 3.6 (Efficient and approximate PML for two dimensions).

For $d=2$ , let $\textbf{y}^{\textbf{n}}$ be a $d$ -sequence of $d$ -length $\textbf{n}=(\textbf{n}(1),\textbf{n}(2))$ . There is an algorithm that computes an $\exp(-\widetilde{O}\left(\textbf{n}(1)^{4/5}+\textbf{n}(2)^{4/5}\right))$ -approximate PML $d$ -distribution $\textbf{p}_{\mathrm{approx}}$ in $\widetilde{O}(\textbf{n}(1)+\textbf{n}(2)+\textbf{n}(1)^{3/5}\textbf{n}(2)^{3/5})$ time.

As mentioned before, one of the important applications of approximate multidimensional PML is in estimating symmetric properties for $d$ -distributions. A symmetric property is a function of $d$ -distributions that is invariant to a permutation of the labels. Here we study one such symmetric property for $d=2$ called KL divergence that is studied in the context of PML. Estimation of KL divergence between two distributions is well studied and estimators that achieve optimal sample complexity were given by [BZLV16, HJW16]. In Theorem 3.7, we show that approximate PML is sample complexity optimal for estimating KL divergence. A similar result was already shown in [Ach18] (Theorem 6) for exact PML and we use the same proof with slight modification to prove our result. In 3.8, we give an efficient version of Theorem 3.7 by combining it with 3.6.

Theorem 3.7 (Optimal sample complexity for KL divergence).

*Let $B$ be such that, $\forall x\in\mathcal{D}$ , $\frac{\textbf{p}(1)_{x}}{\textbf{p}(2)_{x}}\leq B$ and let $\textbf{n}=(\textbf{n}(1),\textbf{n}(2))$ be the optimal sample complexity for estimating KL divergence between $\textbf{p}(1)$ and $\textbf{p}(2)$ to an accuracy $\epsilon$ . If 333Recall $N$ here is the size of domain $\mathcal{D}$ . $\epsilon>\frac{\log^{3}N}{N}$ and $B\leq\epsilon^{2.24}N^{0.24}$ , then $\beta$ -approximate PML $d$ -distribution (for $d=2$ ) with $\beta>\exp(-\widetilde{O}\left(\textbf{n}(1)^{4/5}+\textbf{n}(2)^{4/5}\right))$ is sample complexity optimal for estimating KL divergence to an accuracy $4\epsilon$ . *

Theorem 6 in [Ach18] also requires $\epsilon>\frac{\log^{3}N}{N}$ and a slightly weaker version of the other condition ( $B^{3/2}\leq\epsilon^{0.99}N^{0.49}$ ).

Corollary 3.8 (Efficient estimator for KL divergence).

Let $B$ be such that, $\forall x\in\mathcal{D}$ , $\frac{\textbf{p}(1)_{x}}{\textbf{p}(2)_{x}}\leq B$ and let $\textbf{n}=(\textbf{n}(1),\textbf{n}(2))$ be the optimal sample complexity for estimating KL divergence between $\textbf{p}(1)$ and $\textbf{p}(2)$ to an accuracy $\epsilon$ . If $\epsilon>\frac{\log^{3}N}{N}$ and $B\leq\epsilon^{2.24}N^{0.24}$ , then there exists a PML based universal plug-in estimator that runs in $\widetilde{O}(\textbf{n}(1)+\textbf{n}(2)+\textbf{n}(1)^{3/5}\textbf{n}(2)^{3/5})$ time and is sample complexity optimal for estimating KL divergence to an accuracy $4\epsilon$ .

4 Existence of Structured Approximate PML for One Dimension

Here we provide the proof for Theorem 3.1. First, we show the existence of an approximate PML distribution with a nice structure in Sections 4.1, 4.2 and 4.3. Then, we exploit this structure in Section 4.4 to give an algorithm that returns a fractional solution with running time ranging from nearly linear to sub linear depending on the desired approximation factor. Finally, in Section 4.5 we present a rounding algorithm that takes the fractional solution from the previous step as input and returns an approximate PML distribution within the desired approximation factor.

First, we show the existence of a distribution with minimum non-zero probability value $\Omega(\frac{1}{n^{2}})$ that is an $\exp\left(-6\right)$ -approximate PML distribution.

Lemma 4.1 (Minimum probability lemma).

For any profile $\phi\in\Phi^{n}$ , there exists a distribution $\textbf{p}^{\prime\prime}\in\Delta^{\mathcal{D}}$ such that $\textbf{p}^{\prime\prime}$ is a $\exp\left(-6\right)$ -approximate PML distribution and $\min_{x\in\mathcal{D}:\textbf{p}^{\prime\prime}_{x}\neq 0}\textbf{p}^{\prime\prime}_{x}\geq\frac{1}{2n^{2}}$ .

Proof.

See Appendix A. ∎

This lemma allows us define a region in which our approximate PML takes all its probability values and we use this fact throughout the paper. In Section 4.1 and Section 4.2 we show how we can further simplify the problem of computing an approximate PML by discretizing the probability and the frequency spaces respectively.

4.1 Probability discretization

Let $\textbf{P}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\{(1+\epsilon_{1})^{1-i}:i=1,\dots b_{1}\}$ where $b_{1}=O(\frac{\log n}{\epsilon_{1}})$ is such that $(1+\epsilon_{1})^{1-b_{1}}\leq\frac{1}{2n^{2}}$ for some $\epsilon_{1}\in(0,1)$ . P is the set representing discretization of probability space and discretization introduces a technicality of probability values not summing up to one and we define pseudo-distributions and discrete pseudo-distribution to handle it.

Definition 4.2 (Pseudo-distribution).

$\textbf{q}\in[0,1]^{\mathcal{D}}_{\mathbb{R}}$ is a pseudo-distribution if $\|\textbf{q}\|_{1}\leq 1$ and a discrete pseudo-distribution if all its entries are in P as well. We use $\Delta_{\mathrm{pseudo}}^{\mathcal{D}}$ and $\Delta_{\mathrm{discrete}}^{\mathcal{D}}$ to denote the set of all such pseudo-distributions respectively. 444 As discussed in Section 2 we extend all functions of distributions as functions defined for any general vector in $\mathbb{R}^{\mathcal{D}}$ and therefore to pseudo-distributions as well. For convenience we refer to $\mathbb{P}(\textbf{q},\phi)$ for any pseudo-distribution q as the “probability” of profile $\phi$ or PML objective value with respect to q.

One of the important structural properties we prove here is the following: there exists a discrete pseudo-distribution q that when converted to a distribution by dividing all its entries by its $\ell_{1}$ norm ( $\frac{\textbf{q}}{\|\textbf{q}\|_{1}}$ ) is an approximate PML distribution. Even stronger, the discrete pseudo-distribution q itself has $\mathbb{P}(\textbf{q},\phi)$ value that approximates $\mathbb{P}(\textbf{p}_{pml,\phi},\phi)$ within a good factor and converting q into a distribution by its $\ell_{1}$ norm is only going to help us in this probability because $\|\textbf{q}\|_{1}\leq 1$ . In the rest of the paper we refer to such a discrete pseudo-distribution as an approximate PML pseudo-distribution and for the earlier reason we focus on finding an approximate PML pseudo-distribution.

The way we show the existence of such a discrete pseudo-distribution that is an approximate PML pseudo-distribution is by taking the PML distribution and converting it into a discrete pseudo-distribution while still preserving the PML objective value to a desired approximation factor. Our next lemma formally proves a general version of this statement. In the remainder of this paper, for notational convenience, for a scalar $c$ and set S we use the notation $\lfloor c\rfloor_{\textbf{S}}$ and $\lceil c\rceil_{\textbf{S}}$ to denote:

[TABLE]

Definition 4.3 (Discrete pseudo-distribution).

For any distribution $\textbf{p}\in\Delta^{\mathcal{D}}$ , its discrete pseudo-distribution $\textbf{q}=\mathrm{disc}(\textbf{p})\in\Delta_{\mathrm{discrete}}^{\mathcal{D}}$ is defined as:

[TABLE]

Note that $\lfloor\textbf{p}_{x}\rfloor_{\textbf{P}}\geq\frac{\textbf{p}_{x}}{1+\epsilon_{1}}$ . Further, for $\textbf{p}\in\Delta^{\mathcal{D}}$ , $\frac{1}{1+\epsilon_{1}}\leq||\mathrm{disc}(\textbf{p})||_{1}\leq 1$ . We next state a result that captures the impact of discretizing the probability space.

Lemma 4.4 (Probability discretization lemma).

For any profile $\phi\in\Phi^{n}$ and distribution $\textbf{p}\in\Delta^{\mathcal{D}}$ , its discrete pseudo-distribution $\textbf{q}=\mathrm{disc}(\textbf{p})\in\Delta_{\mathrm{discrete}}^{\mathcal{D}}$ satisfies:

[TABLE]

Proof.

The first inequality is immediate because $\textbf{q}_{x}=\lfloor\textbf{p}_{x}\rfloor_{\textbf{P}}\leq\textbf{p}_{x}$ for all $x\in\mathcal{D}$ . To show second inequality consider any sequence $y^{n}\in\mathcal{D}^{n}$ ,

[TABLE]

In the inequality above we use $\sum_{x\in\mathcal{D}}\textbf{f}(y^{n},x)=n$ . Now,

[TABLE]

∎

4.2 Multiplicity discretization

Let $\textbf{M}=\{\lceil(1+\epsilon_{2}/2)^{1}\rceil,\lceil(1+\epsilon_{2}/2)^{2}\rceil,\dots,\lceil(1+\epsilon_{2}/2)^{k-1}\rceil,n\}\cup\{1,2,3,\dots,\lceil\frac{1}{\epsilon_{2}}\rceil\}$ be the set representing discretization of multiplicities where $k=O(\frac{\log n}{\epsilon_{2}})$ is such that $\lceil(1+\epsilon_{2}/2)^{k}\rceil\geq n$ , $\lceil(1+\epsilon_{2}/2)^{k-1}\rceil<n$ and as before $\epsilon_{2}\in(0,1)$ will be carefully choose later. Let $b_{2}=|\textbf{M}|=O(\frac{\log n}{\epsilon_{2}})$ and note the definition of M keeps all positive integers $\leq\lceil\frac{1}{\epsilon_{2}}\rceil$ . We use $\mathrm{m}_{j}$ to denote elements of set M and using this set M we define an analogous quantity to profile called discrete profile.

Definition 4.5 (Discrete profile).

For a sequence $y^{n}\in\mathcal{D}^{n}$ , its discrete profile $\phi^{\prime}=\Phi^{\prime}(y^{n})\in\mathbb{Z}_{+}^{b_{2}}$ is a profile and is defined as: $\phi^{\prime}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}(\phi^{\prime}_{j})_{j=1\dots b_{2}}$ , where $\phi^{\prime}_{j}=\phi^{\prime}_{j}(y^{n})\stackrel{{\scriptstyle\mathrm{def}}}{{=}}|\{x\in\mathcal{D}~{}|~{}\lceil\textbf{f}(y^{n},x)\rceil_{\textbf{M}}=\mathrm{m}_{j}\}|$ and $n^{\prime}=\sum_{x\in\mathcal{D}}\lceil\textbf{f}(y^{n},x)\rceil_{\textbf{M}}=\sum_{j=1}^{b_{2}}\mathrm{m}_{j}\phi^{\prime}_{j}$ is the length of discrete profile $\phi^{\prime}$ with $n^{\prime}\leq(1+\epsilon_{2})n$ . We use $\Phi_{\mathrm{discrete}}^{n}$ to denote the set of all such discrete profiles.

Note:

As mentioned in the definition, a discrete profile is also a profile. Note that in the representation of discrete profile we might have indices $i$ with $\phi^{\prime}_{i}=0$ , however we have defined profiiles so that there are no such zero entries. We keep these zero entries in our discrete profile $\phi^{\prime}$ for notational convenience and proof simplification. Further it only takes $O(\phi_{size})$ time to write a discrete profile from access to a profile $\phi$ in the representation discussed in Section 2.1.

A discrete profile $\phi^{\prime}$ is a profile of length $n^{\prime}$ and it correspond to profile of some sequences of length $n^{\prime}$ . One such sequence can be obtained by appending $\lceil\textbf{f}(y^{n},x)\rceil_{\textbf{M}}-\textbf{f}(y^{n},x)$ of $x$ symbols to sequence $y^{n}$ itself. The probability of $\phi^{\prime}$ with respect to a distribution p is straightforward:

[TABLE]

We next state a result that captures the impact of discretizing the multiplicity space. It is important to note that probability terms ( $\mathbb{P}(\textbf{p},\phi)$ and $\mathbb{P}(\textbf{p},\phi^{\prime})$ ) have different summation terms and yet we show their values approximate each other.

Lemma 4.6 (Profile discretization lemma).

For any distribution $\textbf{p}\in\Delta^{\mathcal{D}}$ , and a sequence $y^{n}\in\mathcal{D}^{n}$ :

[TABLE]

where $\phi=\Phi(y^{n})$ and $\phi^{\prime}=\Phi^{\prime}(y^{n})$ are the profile and discrete profile of $y^{n}$ respectively.

Proof.

See Appendix B. ∎

Combining both Lemma 4.4 and Lemma 4.6 we bound the impact of discretizing both probabilities and multiplicities.

Corollary 4.7 (Discretization lemma).

For any distribution $\textbf{p}\in\Delta^{\mathcal{D}}$ , and a sequence $y^{n}\in\mathcal{D}^{n}$ . If $\textbf{q}=\mathrm{disc}(\textbf{p})$ is the discrete distribution of p then,

[TABLE]

where $\phi=\Phi(y^{n})$ and $\phi^{\prime}=\Phi^{\prime}(y^{n})$ are the profile and discrete profile of $y^{n}$ respectively.

The discretization lemma above suggests that optimizing over over discrete pseudo-distributions with $\phi^{\prime}$ as input is approximately as good as as optimizing over distributions with $\phi$ as input. This result motivates the definition of a new objective function which we introduce and study next.

4.3 Discrete PML Optimization

Here we define a new optimization problem that admits convex relaxations and further returns an approximate PML pseudo-distribution555Note we call a pseudo-distribution q an approximate PML pseudo-distribution if it satisfies $\mathbb{P}(\textbf{q},\phi^{\prime})\geq\beta\mathbb{P}(\textbf{p}_{pml,\phi},\phi)$ , for some non-trivial $\beta$ .. First, we define a discrete profile maximum likelihood (DPML) which is just the PML objective maximized over discrete pseudo-distributions with discrete profile as input. In 4.9 we show the optimal discrete pseudo-distribution of this new objective is an approximate PML pseudo-distribution. In Lemma 4.10, we rephrase the DPML optimization problem. Finally, using this DPML reformulation, we define a new optimization problem that we call a single discrete PML (SDPML) and in Lemma 4.14, we show the maximizing discrete pseudo-distribution for the SDPML objective is an approximate PML pseudo-distribution.

Definition 4.8 (Discrete profile maximum likelihood).

Let $y^{n}\in\mathcal{D}^{n}$ be any sequence, $\phi=\Phi(y^{n})$ and $\phi^{\prime}=\Phi^{\prime}(y^{n})$ be its profile and discrete profile respectively, a discrete profile maximum likelihood (DPML) pseudo-distribution $\textbf{q}_{dpml,\phi^{\prime}}\in\Delta_{\mathrm{discrete}}^{\mathcal{D}}$ is:

[TABLE]

and $\mathbb{P}(\textbf{q}_{dpml,\phi^{\prime}},\phi^{\prime})$ is the maximum objective value.

Corollary 4.9 (DPML is an approximate PML).

For any sequence $y^{n}\in\mathcal{D}^{n}$ if $\phi=\Phi(y^{n})$ and $\phi^{\prime}=\Phi^{\prime}(y^{n})$ are its profile and discrete profile respectively, then

[TABLE]

Proof.

Note that $\textbf{q}_{pml,\phi}=\mathrm{disc}(\textbf{p}_{pml,\phi})$ is a discrete pseudo-distribution. The result follows from 4.7 applied to $\textbf{p}_{pml,\phi}$ . ∎

In a approximate sense, our 4.7 suggests that working with discrete profile and discrete pseudo-distributions is no different than original profile and distribution itself.

In the next two lemmas we rephrase the DPML optimization problem in forms that are amenable to convex relaxation. To do this, we introduce some new notation.

•

As before let P and M be sets representing discretization of probabilities and frequencies respectively. Recall that we used $1=\mathrm{m}_{1}<\dots<\mathrm{m}_{j}\dots<\mathrm{m}_{b_{2}}$ to denote the elements of set M and we use $\zeta_{1}<\dots<\zeta_{i}\dots<\zeta_{b_{1}}$ to denote the elements of set P. Let $\zeta\in\mathbb{R}^{b_{1}}$ be the vector with elements indexed from $1$ to $b_{1}$ and $i$ th element equal to $\zeta_{i}$ . Also let $\mathrm{m}\in\mathbb{R}^{(b_{2}+1)}$ be the vector with elements indexed from [math] to $b_{2}$ . Its zeroth entry (denoted by $\mathrm{m}_{0}$ ) is equal to [math] and $j$ th entry is equal to $\mathrm{m}_{j}\in\textbf{M}$ .

•

Let $X\in\mathbb{Z}_{+}^{b_{1}\times(b_{2}+1)}$ be a variable matrix with entries $X_{ij}$ for $i\in[1,b_{1}],j\in[0,b_{2}]$ . As in the case for vector $\mathrm{m}$ , our second index $j$ of variable matrix $X$ starts at [math] and not at $1$ . Here the variable $X_{ij}$ counts the number of domain symbols $x\in\mathcal{D}$ with probability value $\zeta_{i}$ and frequency $\mathrm{m}_{j}$ . Further, $X_{i,0}$ counts the number of unseen domain symbols $x\in\mathcal{D}$ with probability value $\zeta_{i}$ .

•

For any vector v and set $S$ , we use $\textbf{v}_{S}$ to denote the $|S|$ length vector corresponding to the portion of vector v associated with index set $S$ .

•

For a discrete profile $\phi^{\prime}=(\phi^{\prime}_{j})_{j=1\dots b_{2}}$ (corresponding to sequence $y^{n}$ ), define

$~{}~{}~{}~{}\textbf{K}_{\phi^{\prime}}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\{X\in\mathbb{Z}_{+}^{b_{1}\times(b_{2}+1)}~{}\Big{|}~{}~{}(X^{T}\mathrm{1})_{[1,b_{2}]}=\phi^{\prime},\text{ and }\zeta^{T}X\mathrm{1}\leq 1\}$

Note the constraint $(X^{T}\mathrm{1})_{[1,b_{2}]}=\phi^{\prime}$ does not involve $X_{0,j}$ variables that corresponds to unseen elements. These variables only appear in the constraint $\zeta^{T}X\mathrm{1}\leq 1$ which ensures our output is always a pseudo-distribution.

•

For a discrete profile $\phi^{\prime}=(\phi^{\prime}_{j})_{j=1\dots b_{2}}$ (of $y^{n}$ ) and a discrete pseudo-distribution q, also define

$~{}~{}~{}~{}\textbf{K}_{\textbf{q},\phi^{\prime}}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\{X\in\mathbb{Z}_{+}^{b_{1}\times(b_{2}+1)}~{}\Big{|}~{}~{}(X^{T}\mathrm{1})_{[1,b_{2}]}=\phi^{\prime},\text{ and }X\mathrm{1}=\ell^{\textbf{q}}\}$ where $\ell^{\textbf{q}}\in\mathbb{R}^{b_{1}}$ and $\ell^{\textbf{q}}_{i}$ denote the number of domain elements with probability value $\zeta_{i}\in\textbf{P}$ in pseudo-distribution q. It will be clear from our next lemma why we define these constraint sets.

The advantage of probability and profile discretization we described earlier is that many types in the set $\{\psi~{}|~{}\Phi(\psi)=\phi^{\prime}\}$ share the same probability value of being observed and our goal is to group them using these $X_{ij}$ variables. Exploiting this idea, we next give a different formulation for the DPML objective.

Lemma 4.10 (DPML objective reformulation).

For any discrete pseudo-distribution $\textbf{q}\in\Delta^{\mathcal{D}}$ and discrete profile $\phi^{\prime}\in\Phi_{\mathrm{discrete}}^{n}$ :

[TABLE]

Proof.

Recall from Equation 3,

[TABLE]

For convenience, we call a type $\psi$ valid if it belongs to set $\{\psi~{}|~{}\Phi(\psi)=\phi^{\prime}\}$ . Recall that variable $X_{ij}$ represents the number of domain elements with probability value $\zeta_{i}$ and frequency $\mathrm{m}_{j}$ . In this representation and for the discrete pseudo-distribution q, each valid type $\psi$ corresponds to the following unique variable assignment $X\in\textbf{K}_{\textbf{q},\phi^{\prime}}$ : $X_{ij}=|\{x\in\mathcal{D}~{}|~{}\textbf{q}_{x}=\zeta_{i}\text{ and }\psi_{x}=\mathrm{m}_{j}\}|$ . Using the previous expression it is not hard to write the exact expression for the probability term associated with the valid type $\psi$ ,

[TABLE]

Previous discussion showed that every valid type corresponds to a unique variable assignment. However this uniqueness property no more holds in the reverse direction and multiple valid types might share the same variable assignment. This where our grouping occurs and is an interesting case that we study next.

For any variable assignment $X$ , it is clear from the middle term in Equation 7 that all valid types $\psi$ associated with $X$ share the same probability value of being observed. With this observation, it is now enough to argue about the number of valid types associated with a variable assignment $X$ to prove our lemma. We make this argument next by constructing all valid types associated with $X$ .

First consider all domain elements with a fixed probability value $\zeta_{i}$ and the number of these elements is equal to $\sum_{j=0}^{b_{2}}X_{ij}$ . We can generate part of a valid type corresponding to probability value $\zeta_{i}$ by picking any partition of these $\sum_{j=0}^{b_{2}}X_{ij}$ domain elements into groups of sizes $\{X_{ij}\}_{j\in[0,b_{2}]}$ . This corresponds to a multinomial coefficient and the number of types associated with $X$ is just,

[TABLE]

Here we only generated partial valid types corresponding to probability value $\zeta_{i}$ . To generate a full valid type we just need to combine these partial valid types generated for each probability value $\zeta_{i}$ . Let $S_{X}$ denote all such full valid types associated with a variable assignment $X$ and generating a full valid type corresponds to groups (for each probability value $\zeta_{i}$ ) of independent possibilities considered conjointly. Further the cardinality of set $S_{X}$ is just the multiplication of cardinalities of each of these groups and is explicitly written below,

[TABLE]

We are almost done with the proof and all we do next is formally derive the expression in our lemma statement to complete the proof. From Equation 3,

[TABLE]

∎

In the lemma above we wrote the $\mathbb{P}(\textbf{q},\phi^{\prime})$ in terms of constraint set $\textbf{K}_{\textbf{q},\phi^{\prime}}$ and to use this definition we need access to pseudo-distribution q. We overcome this difficulty in our next lemma by giving an inequality that relates $\mathbb{P}(\textbf{q},\phi^{\prime})$ with constraint set $\textbf{K}_{\phi^{\prime}}$ that only depends on $\phi^{\prime}$ and not q itself.

Lemma 4.11 (DPML objective relaxed).

For any sequence $y^{n}\in\mathcal{D}^{n}$ , and a discrete pseudo-distribution $\textbf{q}\in\Delta^{\mathcal{D}}$ the DPML objective can be upper bounded by:

[TABLE]

where $\phi^{\prime}=\Phi^{\prime}(y^{n})\in\Phi_{\mathrm{discrete}}^{n}$ is discrete profile of $y^{n}$ .

Proof.

The proof follows because $\textbf{K}_{\textbf{q},\phi^{\prime}}\subseteq\textbf{K}_{\phi^{\prime}}$ and invoking Lemma 4.10. ∎

In the above lemma we only showed one side of the inequality and it not clear how working with RHS relates to the LHS. Inf Section 4.5 we present an algorithm to achieve the other side of the inequality. The cardinality of set $\textbf{K}_{\phi^{\prime}}$ in the above formulation is small and we formalize this next.

Lemma 4.12 (Cardinality of $\textbf{K}_{\phi^{\prime}}$ ).

For any sequence $y^{n}\in\mathcal{D}^{n}$ and its associated discrete profile $\phi^{\prime}=\Phi^{\prime}(y^{n})$ :

[TABLE]

Proof.

$\textbf{K}_{\phi^{\prime}}$ is a set of vectors in $\mathbb{Z}_{+}^{b_{1}\times(b_{2}+1)}$ and each coordinate takes an integer value in $[0,2n^{2}]$ (Lemma 4.1 combined with the constraint $\zeta^{T}X\mathrm{1}\leq 1$ ensures this fact). The lemma statement follows because $\textbf{K}_{\phi^{\prime}}\leq(2n^{2})^{b_{1}(b_{2}+1)}\in\exp\left((b_{1}\times b_{2})O(\log n)\right)$ . ∎

In our final optimization problem we just optimize over one term in the set $\textbf{K}_{\phi^{\prime}}$ instead of working with summation over all the terms. Focusing on the largest of these terms, gives a $1/|\textbf{K}_{\phi^{\prime}}|$ approximation of the sum. Combining this with Lemma 4.12 motivates us to consider the following objective, define:

[TABLE]

It is important to note that there is a discrete $d$ -pseudodistribution $\textbf{q}_{X}$ that correspond to each variable assignment $X\in\textbf{K}_{\phi^{\prime}}$ . The description of this distribution is as follows: For each $i\in[1,b_{1}]$ , the number of domain elements with probability value $\zeta_{i}$ in q is equal to $(X\mathrm{1})_{i}$ 666This description only provides non zero probability values and also does not provide any labels, however it is sufficient for estimating all symmetric properties mentioned in this paper.. We now go ahead and define the optimization problem involving $\textbf{w}_{\mathrm{sdpml}}(X)$ that also help us compute the term that is largest in the summation of terms in Equation 8. After this definition, we provide a lemma relating the PML objective with this new optimization problem.

Definition 4.13 (Single discrete profile maximum likelihood).

For any sequence $y^{n}\in\mathcal{D}^{n}$ and its associated discrete profile $\phi^{\prime}=\Phi^{\prime}(y^{n})$ , a single discrete profile maximum likelihood (SDPML) distribution $\textbf{q}_{sdpml,\phi^{\prime}}$ is:

[TABLE]

and $\textbf{q}_{sdpml,\phi^{\prime}}$ is the pseudo-distribution corresponding to $X_{sdpml,\phi^{\prime}}$ .

Lemma 4.14 (SDPML relation to PML).

For any sequence $y^{n}\in\mathcal{D}^{n}$ ,

[TABLE]

where $\phi=\Phi(y^{n})$ and $\phi^{\prime}=\Phi^{\prime}(y^{n})$ are the profile and discrete profile associated with $y^{n}$ .

Proof.

$~{}~{}~{}~{}~{}\binom{n^{\prime}}{\phi^{\prime}}\textbf{w}_{\mathrm{sdpml}}(X_{sdpml,\phi^{\prime}})\geq\binom{n^{\prime}}{\phi^{\prime}}\textbf{w}_{\mathrm{sdpml}}(X_{dpml,\phi^{\prime}})\geq\exp\left(-(b_{1}\times b_{2})\log n\right)\mathbb{P}(\textbf{q}_{dpml,\phi^{\prime}},\phi^{\prime})~{}~{}~{}~{}~{}$

[TABLE]

The second inequality follows from Lemma 4.12, 4.11 and last follows from 4.9. ∎

To simplify and better understand the expression in Lemma 4.14 just substitute $\epsilon_{1}=\epsilon_{2}=\frac{1}{n^{1/3}}$ and note that $X_{sdpml,\phi^{\prime}}\in\textbf{K}_{\textbf{q}_{sdpml},\phi^{\prime}}$ , and $\textbf{w}_{\mathrm{sdpml}}(X_{sdpml,\phi^{\prime}})$ is just one term in the summation of terms in Equation 6. Using Lemma 4.10 we know that $\binom{n^{\prime}}{\phi^{\prime}}\textbf{w}_{\mathrm{sdpml}}(X_{sdpml,\phi^{\prime}})\leq\mathbb{P}(\textbf{q}_{sdpml,\phi^{\prime}},\phi^{\prime})$ and combining this with previous lemma we get that the discrete pseudo-distribution $\textbf{q}_{sdpml,\phi^{\prime}}$ is an $\exp(-\widetilde{O}(n^{2/3}))$ -approximate PML pseudo-distribution. All we do next is provide a convex relaxation for function $\textbf{w}_{\mathrm{sdpml}}(X)$ to arrive at our final optimization problem. This relaxation produces a real valued $X$ and later we give a rounding algorithm to get an integral solution.

4.4 Convex relaxation of SDPML

In the previous subsection we showed that the SDPML objective is a good approximation to the PML objective. However the objective function of SDPML is defined only over the integers and in this subsection we present a convex relaxation of SDPML.

First, we consider the feasible set $\textbf{K}_{\phi^{\prime}}$ of SDPML and relax the integer constraint on variables $X_{ij}$ to get the following new constraint set:

[TABLE]

In the later subsections, we show how we deal with these fractional solutions by presenting a rounding algorithm with a good approximation ratio.

Secondly, we relax the objective function of SDPML itself. The objective of SDPML is defined only on the integral set. We next define a continuous relaxation of this objective function which is also log-concave.

[TABLE]

The lemma below states that continuous version is not far from the actual SDPML objective.

Lemma 4.15 ( $\textbf{g}(\cdot)$ approximates SDPML objective).

For any sequence $y^{n}\in\mathcal{D}^{n}$ and its associated discrete profile $\phi^{\prime}=\Phi^{\prime}(y^{n})$ . If $X\in\textbf{K}_{\phi^{\prime}}$ , then

[TABLE]

Proof.

See Appendix C. ∎

A key fact about function $\textbf{g}(X)$ is that it is log-concave, so we can apply optimization machinery from convex optimization to optimize it.

Lemma 4.16.

Function $\textbf{g}(X)$ is log-concave in $X$ .

Proof.

See Appendix C. ∎

Maximizing log concave objective function $\textbf{g}(\cdot)$ over the relaxed convex set $\textbf{K}^{f}_{\phi^{\prime}}$ easily reduces to a convex optimization problem and can be solved efficiently. Below is the convex relaxation of our SDPML objective,

[TABLE]

Formulation above is in the form of a general optimization problem $(11.14)$ in [LSW15a] that solves it using a cutting plane method. The algorithm in [LSW15a] requires to implement a $\delta$ -2nd-order-optimization oracle (defined later in the appendix) and we provide an algorithm to implement this $\delta$ -2nd-order-optimization oracle for our convex program. Further, to upper bound the number of calls to such an oracle we need to bound the singular values of our constraint matrix. Everything put together we get the following theorem.

Theorem 4.17 (Solver for convex relaxation to SDPML).

There exists a cutting plane method based algorithm that outputs a feasible solution $X^{\prime}$ to optimization problem 12, i.e. $X^{\prime}\in\textbf{K}^{f}_{\phi^{\prime}}$ and satisfies:

[TABLE]

in $O\left(b_{2}^{2}b_{1}\log^{O(1)}(\frac{b_{1}b_{2}}{\delta})+b_{2}^{3}\log^{O(1)}(\frac{b_{1}b_{2}}{\delta})\right)$ time.

Proof.

See Appendix D. ∎

4.5 Algorithm and runtime analysis

Here we give the complete description of our final algorithm to find an approximate PML distribution. The analysis in previous sections suggests that it suffices to find a discrete pseudo-distribution that approximates SDPML objective, which we replaced by a convex relaxation. First, we give the complete algorithm. Then, we present the algorithm that takes an optimal solution to the convex proxy for SDPML and produces an approximate PML distribution. Recall that $\textbf{K}^{f}_{\phi^{\prime}}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\{X\in\mathbb{R}^{b_{1}\times(b_{2}+1)}~{}\big{|}~{}(X^{T}\mathrm{1})_{[1,b_{2}]}=\phi^{\prime},\text{ and }\zeta^{T}X\mathrm{1}\leq 1\}$ .

In the algorithm we first maximize over the set of fractional solutions $\textbf{K}^{f}_{\phi^{\prime}}$ instead of $\textbf{K}_{\phi^{\prime}}$ and we round our solution $X^{\prime}$ to an integral solution $X$ that belongs to extended set of $\textbf{K}_{\phi^{\prime}}$ . The rounding algorithms is presented next.

The solution $X$ returned by the rounding procedure is defined on an extended discretized probability space $\textbf{P}^{\prime}$ , where $\textbf{P}^{\prime}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\textbf{P}\cup\{\zeta_{b_{1}+j}\}_{j\in[1,b_{2}]}$ . To derive the relation between solution $X$ and PML objective value we need to extend some definitions studied earlier. First, we define $\zeta_{ext}$ as the vector whose entries are exactly the elements of $\textbf{P}^{\prime}$ . Note we still use $\zeta_{i}$ for all $i\in[1,b_{1}+b_{2}]$ to refer to elements of $\zeta_{ext}$ . Further, for any pseudo-distribution q with all its probability values in set $\textbf{P}^{\prime}$ (we call it an extended discrete pseudo-distribution) and discrete profile $\phi^{\prime}$ , we first define following extensions of sets $\textbf{K}_{\textbf{q},\phi^{\prime}}$ and $\textbf{K}_{\phi^{\prime}}$ ,

[TABLE]

where $\ell^{\textbf{q}}\in\mathbb{R}^{b_{1}+b_{2}}$ and $\ell^{\textbf{q}}_{i}$ denote the number of domain elements with probability value $\zeta_{i}\in\textbf{P}^{\prime}$ .

Further by Lemma 4.10, for any extended discrete pseudo-distribution q and a discrete profile $\phi^{\prime}$ , the following equality holds,

[TABLE]

Similarly for any $X\in\textbf{K}^{ext}_{\textbf{q},\phi^{\prime}}$ , below are the natural extension of definitions of functions $\textbf{w}_{\mathrm{sdpml}}(\cdot)$ and $\textbf{g}(\cdot)$ ,

[TABLE]

We are ready to analyze our rounding algorithm. First, we provide some interesting properties solution $X$ returned by our rounding procedure.

Claim 4.18.

The solution $X\in\mathbb{Z}_{+}^{(b_{1}+b_{2})\times(b_{2}+1)}$ returned by rounding procedure (2) above satisfies:

$(X^{\prime}\mathrm{1})_{i}-(b_{2}+1)\leq(X\mathrm{1})_{i}\leq(X^{\prime}\mathrm{1})_{i}\quad\forall i\in[1,b_{1}]$ ** 2. 2.

$X\in\textbf{K}^{ext}_{\phi^{\prime}}$ .

Proof.

Claim (1) follows because $X^{\prime}_{ij}-1\leq X_{ij}\leq X^{\prime}_{ij}$ for all $i\in[1,b_{1}],j\in[0,b_{2}]$ . Now note $\sum_{i=1}^{b_{1}+b_{2}}X_{ij}=\sum_{i=1}^{b_{1}}X^{\prime}_{ij}=\phi^{\prime}_{\mathrm{m}_{j}}\quad\forall j\in[1,b_{2}]$ because of the adjustments made by new level sets. Further,

[TABLE]

The final inequality follows because $X^{\prime}\in\textbf{K}^{f}_{\phi^{\prime}}$ and therefore $X\in\textbf{K}^{ext}_{\phi^{\prime}}$ and Claim (2) follows. ∎

We next show that for any solution $X$ returned by our rounding algorithm (2), the values $\textbf{w}_{\mathrm{sdpml}}(X)$ and $\textbf{g}(X)$ are close to each other and we summarize this next.

Lemma 4.19.

For any $X\in\textbf{K}^{ext}_{\phi^{\prime}}$ returned by rounding procedure above satisfies:

[TABLE]

Proof.

See Appendix C. ∎

Further using Equation 13, for any $X\in\textbf{K}^{ext}_{\phi^{\prime}}$ , if $\textbf{q}_{X}$ is its corresponding extended discrete pseudo-distribution, then

[TABLE]

In our next lemma, we show that the solution $X\in\textbf{K}^{ext}_{\phi^{\prime}}$ returned by the rounding procedure approximates $\textbf{w}_{\mathrm{sdpml}}(X_{sdpml})$ . Note from Lemma 4.14, we know that $\textbf{w}_{\mathrm{sdpml}}(X_{sdpml})$ is a good approximation to the PML objective.

Lemma 4.20.

The solution $X\in\textbf{K}^{ext}_{\phi^{\prime}}$ returned by rounding procedure above satisfies:

[TABLE]

Proof.

For any $X^{\prime}\in\textbf{K}^{f}_{\phi^{\prime}}$ and $X\in\textbf{K}^{ext}_{\phi^{\prime}}$ returned by our rounding procedure below are the explicit expressions for $\textbf{g}(X)$ and $\textbf{g}(X^{\prime})$ :

[TABLE]

We first bound the probability term:

[TABLE]

The first inequality follows because $\mathrm{m}_{0}=0$ . The fourth inequality follows from AM-GM inequality. The final expression above is the probability term associated with $X$ and the equation above shows that our rounding procedure only increases the probability term and all that matters is to bound the counting term that we do next.

[TABLE]

In the derivation above we used (1) in Claim 4.18. It remains now to lower bound $\textbf{w}_{\mathrm{sdpml}}(X)$ :

[TABLE]

The first and second inequality follow from Lemma 4.19 and Equation 17 respectively. In the third inequality we used $\textbf{g}(X^{\prime})\geq\textbf{g}(X_{sdpml})$ because $X^{\prime}$ is the optimal solution over the relaxed constraint set $\textbf{K}^{f}_{\phi^{\prime}}$ and finally invoked Lemma 4.15 to relate $\textbf{w}_{\mathrm{sdpml}}$ and g. ∎

Now construct the extended discrete pseudo-distribution $\textbf{q}_{X}$ corresponding to the solution $X$ returned by Algorithm 2 by assigning $(X\mathrm{1})_{i}$ elements with a probability value of $\zeta_{i}$ $(\forall i\in[b_{1}+b_{2}])$ . We next provide the proof for our main theorem that proves the distribution $\frac{\textbf{q}_{X}}{\|\textbf{q}_{X}\|_{1}}$ is an approximate PML distribution. Our next theorem proves that the distribution $\frac{\textbf{q}_{X}}{\|\textbf{q}_{X}\|_{1}}$ is an approximate PML distribution.

See 3.1

Proof.

Let $\textbf{q}_{X}$ be the pseudo-distribution corresponding to solution $X$ returned by Algorithm 2. Set $\textbf{p}_{approx}=\frac{\textbf{q}_{X}}{\|\textbf{q}_{X}\|_{1}}$ , then:

[TABLE]

The first inequality follows because $\|\textbf{q}_{X}\|_{1}\leq 1$ , second inequality from 4.7, third inequality follows because $X\in\textbf{K}^{ext}_{\textbf{q}_{X},\phi^{\prime}}$ (because we constructed $\textbf{q}_{X}$ from $X$ ) and $\textbf{w}_{\mathrm{sdpml}}(X)$ computes just one term in the summation over $\textbf{K}^{ext}_{\textbf{q}_{X},\phi^{\prime}}$ (look at the representation of $\mathbb{P}(\textbf{q}_{X},\phi^{\prime})$ as summation over $\textbf{K}^{ext}_{\textbf{q}_{X},\phi^{\prime}}$ from Equation 15), fourth inequality comes from Lemma 4.20 and last inequality follows from Lemma 4.14.

We bound the total running time as follows. Given a profile $\phi$ , it takes $O(\phi_{size})$ to write down the discrete profile $\phi^{\prime}$ , then we need to solve the convex optimization problem 12 which further takes $O\left(\frac{1}{\epsilon_{2}^{2}\times\epsilon_{1}}\log^{O(1)}(\frac{1}{\epsilon_{1}\epsilon_{2}})+\frac{1}{\epsilon_{2}^{3}}\log^{O(1)}(\frac{1}{\epsilon_{1}\epsilon_{2}})\right)$ and our final rounding algorithm can be implemented in time $O(\frac{\log^{2}n}{\epsilon_{1}\epsilon_{2}})$ ( $=O(b_{1}b_{2})$ ). The claimed running time follows by combining these bounds. ∎

5 Unified optimal sample complexity for symmetric properties

Here we study the connection between a universal estimator and approximate PML. We first recall the following theorem in [ADOS16].

Theorem 5.1 (Theorem 4 of [ADOS16]).

For a symmetric property f, suppose there is an estimator $\hat{\textbf{f}}:\Phi^{n}\rightarrow\mathbb{R}$ , such that for any p and observed profile $\phi$ ,

[TABLE]

any $\beta$ -approximate PML distribution satisfies:

[TABLE]

Our goal here is to prove Theorem 3.3 that shows the following: computing an $\exp(\widetilde{O}(n^{2/3}))$ -approximate PML distribution is sufficient to get a plug-in universal estimator that is sample competitive for estimating support size, coverage, entropy and distance from uniform. The proof presented in [ADOS16] showed this connection for an $\exp(\sqrt{n})$ -approximate PML estimator and it is easy to see the proof presented in [ADOS16] works for any $\exp(n^{1-\delta})$ -approximate PML estimator for constant $\delta>0$ . We will need the following two lemmas from [ADOS16, HR18].

Lemma 5.2 (Lemma 2 of [ADOS16]).

Let $\alpha>0$ be a fixed constant. For entropy, support, support coverage, and distance to uniformity there exist profile based estimators that use the optimal number of samples, have bias $\epsilon$ and if we change any of the samples, changes by at most $c\cdot\frac{n^{\alpha}}{n}$ , where $c$ is a positive constant.

Lemma 5.3 ([HR18]).

$|\Phi^{n}|\leq\exp\left(3\sqrt{n}\right)$ **

See 3.3

Proof.

Let f be the property we wish to estimate, p be the underlying distribution and $x^{n},\phi$ are the observed sequence and profile. Set $\alpha=\eta$ ( $\eta$ is a constant and so is $\alpha$ ) and let $\hat{\textbf{f}}$ be the estimator returned by Lemma 5.2. The bias of estimator $\hat{\textbf{f}}$ is

[TABLE]

By McDiarmid’s inequality we get:

[TABLE]

where $c_{*}$ is the change in $\hat{\textbf{f}}$ when one of the samples is changed. Using these inequalities we get:

[TABLE]

In the derivation above we used $c_{*}\leq c\cdot\frac{n^{\alpha}}{n}$ (Lemma 5.2). Invoke Theorem 5.1 with $\delta=\exp\left(-\frac{2\epsilon^{2}}{c^{2}}n^{1-2\alpha}\right)$ we get:

[TABLE]

In the first inequality we used Lemma 5.3. ∎

Appendix A Minimum Probability

Here we provide the proof for our first technical lemma that gives a lower bound of $\Omega(\frac{1}{n^{2}})$ for the minimum non-zero probability value of a $\exp\left(-6\right)$ -approximate PML distribution. To show such a result we use an independent rounding algorithm that is described in the lemma below. We need the following simple claim for the proof of our next lemma.

Claim A.1.

For any non-negative and non-zero vector v and a profile $\phi\in\Phi^{n}$ ,

[TABLE]

Proof.

[TABLE]

∎

See 4.1

Proof.

We do independent rounding to show the existence of such a distribution. For notational convenience we use $\textbf{p}_{pml,\phi}(x)$ to denote the probability of symbol $x$ in the PML distribution $\textbf{p}_{pml,\phi}$ . Let $\textbf{S}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\{x\in\mathcal{D}~{}|~{}\textbf{p}_{pml,\phi}(x)<\frac{1}{n^{2}}\}$ and for all $x\in\textbf{S}$ we define a random variable $Y_{x}$ as follows:

[TABLE]

Clearly $\forall x\in S$ ,

[TABLE]

and in general for any integer power $i$ of random variable $Y_{x}$ we have:

[TABLE]

For the remaining $x\in\bar{\textbf{S}}$ ( $\bar{\textbf{S}}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\mathcal{D}\backslash\textbf{S}$ ) with $\textbf{p}_{pml,\phi}(x)\geq\frac{1}{n^{2}}$ we define:

[TABLE]

Define $\textbf{Y}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}(Y_{x})_{x\in\textbf{S}}$ and $\textbf{Z}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}(Z_{x})_{x\in\bar{\textbf{S}}}$ .

[TABLE]

Define $\textbf{p}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}(\textbf{Y},\textbf{Z})$ to be the concatenation of random vectors Y and Z. All random variables $Y_{x},Z_{x}$ are mutually independent and we have:

[TABLE]

(From Equation 67,68 and the fact that $Z_{x}$ is a constant random variable).

When we generate a random sample p from this distribution, we have a lower bound on the expected value of $\mathbb{P}(\textbf{p},\phi)$ but this is misleading since p may not be a distribution. Scaling p to 1 could significantly reduce the value of $\mathbb{P}(\textbf{p},\phi)$ if $\|\textbf{p}\|_{1}$ is large. However, we show that a constant fraction of the expectation of $\mathbb{P}(\textbf{p},\phi)$ comes from the sample space with bounded $\|\textbf{p}\|_{1}\leq 1+\frac{c}{n}$ . Here $c$ is a constant and assume $c\geq 3$ . Note that:

[TABLE]

The last inequality follows because Z is a constant random vector.

[TABLE]

To argue that a constant fraction of the expectation comes from the sample space with small $\|\textbf{p}\|_{1}$ we need a tight upper bound for:

[TABLE]

For $t\geq c$ , we first upper bound the probability term:

[TABLE]

We will use Chernoff bounds here and to apply them, we convert the $Y_{x}$ random variables into $\{0,1\}$ Bernoulli random variables. Define $\forall x\in\textbf{S}$ ,

[TABLE]

Equivalently:

[TABLE]

Define $\textbf{Y}^{\prime}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}(Y^{\prime}_{x})_{x\in\textbf{S}}$ and $\mu^{\prime}_{\textbf{S}}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\mathbb{E}\left[\|\textbf{Y}^{\prime}\|_{1}\right]=n^{2}\mu_{\textbf{S}}\leq n^{2}$ . For any $t>0$ ,

[TABLE]

Since $\|\textbf{Y}^{\prime}\|_{1}$ is a sum of Bernoulli random variables, by Chernoff bounds:

[TABLE]

Note from A.1 that:

[TABLE]

Substituting back in Equation 69 we have (for $c\geq 3$ ),

[TABLE]

The above inequality implies existence of a $\textbf{p}^{\prime}$ with $\mathbb{P}(\textbf{p}^{\prime},\phi)\geq\frac{1}{4}\mathbb{P}(\textbf{p}_{pml,\phi},\phi)$ and $\|\textbf{p}^{\prime}\|_{1}\leq 1+\frac{c}{n}$ . Define $\textbf{p}^{\prime\prime}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\textbf{p}^{\prime}/\|\textbf{p}\|_{1}$ ,

[TABLE]

In the final inequality substitute $c=3$ and observe $\frac{\exp\left(-c\right)}{4}\geq 1/100$ . Also our rounding procedure always ensures that minimum non-zero entry of $\textbf{p}^{\prime}$ is $\geq\frac{1}{n^{2}}$ that further implies a lower bound on the minimum non-zero probability value of $\textbf{p}^{\prime\prime}$ to be $\frac{1}{n^{2}}\frac{1}{\|\textbf{p}^{\prime}\|_{1}}=\frac{1}{n^{2}}\frac{1}{1+c/n}\geq\frac{1}{2n^{2}}$ . Hence $\textbf{p}^{\prime\prime}$ is our final distribution satisfying the conditions of lemma. ∎

Appendix B Profile Discretization Lemma

Here we prove our profile discretization lemma. We first introduce a new definition called discrete type and then provide new formulations which help us in our proof.

Definition B.1 (Discrete type).

For a sequence $y^{n}\in\mathcal{D}^{n}$ , its discrete type $\psi^{\prime}=\Psi^{\prime}(y^{n})\in\textbf{M}^{\mathcal{D}}$ is:

[TABLE]

For a sequence $y^{n}\in\mathcal{D}^{n}$ let $\textbf{D}=\{\textbf{f}(y^{n},x)\}_{x\in\mathcal{D}}\cup\{1,\dots\lceil\frac{1}{\epsilon_{2}}\rceil\}$ be the set of all its distinct frequencies plus all integers less than $\lceil\frac{1}{\epsilon_{2}}\rceil$ and $d_{1}<d_{2}<\dots<d_{|\textbf{D}|}$ be elements of the set D. For this extended set D, the definition of profile $\phi=(\phi_{j})_{j=1\dots|\textbf{D}|}$ is still the same and $\phi_{j}=|\{x\in\mathcal{D}~{}|~{}\textbf{f}(y^{n},x)=d_{j}\}|$ . In this extended definition there might be indices $j\in[1,|\textbf{D}|]$ with $\phi_{j}=0$ and this extended definition help us write cleaner proof for the next lemma. We first state an equivalent formulation for the probability of its profile $\phi=\Phi(y^{n})$ (from Equation 20 in [OSZ03], Equation 15 in [PJW17]) in terms of its type $\psi=\Psi(y^{n})$ :

[TABLE]

where $S_{\mathcal{D}}$ is the set of all permutations of domain set $\mathcal{D}$ and $\phi_{0}$ is the number of unseen domain elements. The difference between Equation 24 and Equation 3 is the index set over which they are summed.

See 4.6

Proof.

Let $\psi=\Psi(y^{n})$ and $\psi^{\prime}=\Psi^{\prime}(y^{n})$ be the type and discrete type of sequence $y^{n}$ respectively. By Equation 24:

[TABLE]

Similarly:

[TABLE]

where $\phi^{\prime}_{0}$ is the number of unseen domain elements in profile $\phi^{\prime}$ . Note $\phi^{\prime}_{0}=\phi_{0}$ because our discretization procedure does not change the number of unseen domain elements. We now analyze both objectives term by term. For any permutation $\sigma\in S_{\mathcal{D}}$

[TABLE]

The first inequality above follows because $\psi^{\prime}_{\sigma(x)}\leq\psi_{\sigma(x)}(1+\epsilon_{2})$ and using $\psi_{\sigma(x)}\leq\psi^{\prime}_{\sigma(x)}$ we get the following inequality.

[TABLE]

Lets consider terms $C_{\phi}$ and $C_{\phi^{\prime}}$ next:

[TABLE]

Next we lower bound the same quantity:

[TABLE]

Combining both we get:

[TABLE]

To bound our final term we use the extended definition of D. In this definition of D we included all integers less than $\lceil\frac{1}{\epsilon_{2}}\rceil$ and we have $d_{j}=j$ for all $j\leq\lceil\frac{1}{\epsilon_{2}}\rceil$ . Similarly recall all integers less than $\lceil\frac{1}{\epsilon_{2}}\rceil$ also belong to set M and therefore $\mathrm{m}_{j}=j$ for all $j\leq\lceil\frac{1}{\epsilon_{2}}\rceil$ . Now observe that any frequency strictly less than $\lceil\frac{1}{\epsilon_{2}}\rceil$ ( $d_{j}<\lceil\frac{1}{\epsilon_{2}}\rceil$ ) is not discretized and,

[TABLE]

The number of domain symbols $x\in\mathcal{D}$ with $\textbf{f}(y^{n},x)\geq\lceil\frac{1}{\epsilon_{2}}\rceil$ is at most $\epsilon_{2}n$ and $\sum_{j\geq\lceil\frac{1}{\epsilon_{2}}\rceil}{\phi_{j}}\leq\epsilon_{2}n$ . This further implies, $\sum_{j\geq\lceil\frac{1}{\epsilon_{2}}\rceil}{\phi^{\prime}_{j}}\leq\epsilon_{2}n$ . Hence the ratio evaluates to:

[TABLE]

Rewriting the final inequality:

[TABLE]

Combining all eqs. 25, 26 and 27 we have our result. ∎

Appendix C Remaining proofs for Section 4

Here we prove multiple lemmas associated with our functions $\textbf{w}_{\mathrm{sdpml}}(\cdot)$ and $\textbf{g}(\cdot)$ . Our first lemma shows that functions $\textbf{w}_{\mathrm{sdpml}}(\cdot)$ and $\textbf{g}(\cdot)$ approximate each other in their values and later we also show that function $\textbf{g}(X)$ is log-concave in $X$ . To help readability of this section lets recall definitions of functions $\textbf{w}_{\mathrm{sdpml}}(\cdot)$ and $\textbf{g}(\cdot)$ . For any $X\in\textbf{K}^{f}_{\phi^{\prime}}$ ,

[TABLE]

Also for any $X\in\textbf{K}_{\phi^{\prime}}$ ,

[TABLE]

See 4.15

Proof.

By Stirling’s approximation for all integer $n\geq 1$ :

[TABLE]

We slightly use a weaker version of this inequality that holds all integers $n\geq 0$ ,

[TABLE]

In the above expression we used the fact that each $i\in[1,b_{1}]$ , $(X\mathrm{1})_{i}\leq 2n^{2}$ (Lemma 4.1 combined with the constraint $\zeta^{T}X\mathrm{1}\leq 1$ ensures this fact). Also,

[TABLE]

∎

Next we show that function $\textbf{g}(X)$ is log-concave in $X$ and we need the following lemma to prove it.

Lemma C.1.

The function $h:\mathbb{R}^{l}_{\geq 0}\rightarrow\mathbb{R}$ defined for all $\textbf{a}\in\mathbb{R}^{l}_{\geq 0}$ by

[TABLE]

is convex.

Proof.

Let $\textbf{A}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\textbf{a}^{T}\mathrm{1}$ . Direct calculation reveals that for all $i\in[l]$ ,

[TABLE]

The Hessian matrix H is:

[TABLE]

Let $\textbf{D}_{\textbf{a}}=diag(\textbf{a})$ and also $\textbf{a}^{\frac{1}{2}}$ be the entry wise square root vector,

[TABLE]

The last inequality holds because $\frac{1}{\textbf{A}}\textbf{a}^{\frac{1}{2}}\textbf{a}^{\frac{1}{2}T}$ is a rank one matrix and its spectral norm is equal to 1:

[TABLE]

. ∎

See 4.16

Proof.

Recall the definition of $\textbf{g}(X)$ :

[TABLE]

Taking $\log$ on both sides:

[TABLE]

The first term is linear in $X$ and we consider the negative of second and third term and show it is convex.

[TABLE]

In the above expression $X_{i}\in\mathbb{R}^{b_{2}}$ is the $i$ ’th column of matrix $X$ . By Lemma C.1 each of the functions $\textbf{h}_{i}(X_{i})$ is convex and $\textbf{h}(X)=\sum_{i=1}^{b_{1}}\textbf{h}_{i}(X_{i})$ is also convex ( $-\textbf{h}(X)$ is concave). $\textbf{g}(X)$ is sum of a linear and a concave function, and is concave. ∎

In the remaining part of this section, we prove our final result of this section that is used to bound the approximation guarantee of our rounding procedure. Recall our rounding procedure introduces new probability values resulting in a extended discretized probability space $\textbf{P}^{\prime}$ , where $\textbf{P}^{\prime}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\textbf{P}\cup\{\zeta_{b_{1}+j}\}_{j\in[1,b_{2}]}$ . To derive the relation between solution $X$ and PML objective value we defined extended sets $\textbf{K}^{ext}_{\textbf{q},\phi^{\prime}}$ and $\textbf{K}^{ext}_{\phi^{\prime}}$ . Further for any $X\in\textbf{K}^{ext}_{\textbf{q},\phi^{\prime}}$ , recall that functions $\textbf{w}_{\mathrm{sdpml}}(\cdot)$ and $\textbf{g}(\cdot)$ are defined as follows,

[TABLE]

In the following lemma we show that for any $X\in\textbf{K}^{ext}_{\textbf{q},\phi^{\prime}}$ returned by our rounding procedure the functions $\textbf{w}_{\mathrm{sdpml}}(X)$ and $\textbf{g}(X)$ approximate each other in their values. See 4.19

Proof.

For all integers $n\geq 0$ , recall the weaker version of sterlings approximation we used earlier ,

[TABLE]

Now,

[TABLE]

and

[TABLE]

Now $\textbf{P}^{\prime}=\textbf{P}\cup\{\zeta_{b_{1}+j}\}_{j\in[1,b_{2}]}$ and for any $j\in[1,b_{2}]$ , $\zeta_{b_{1}+j}$ is a convex combination of elements in P and therefore $\zeta_{b_{1}+j}\geq 1/2n^{2}$ . In the above expression we used the fact that each $i\in[1,b_{1}]$ , $(X\mathrm{1})_{i}\leq 2n^{2}$ (For any $i\in[1,b_{1}+b_{2}]$ , $\zeta_{i}\geq 1/2n^{2}$ and further combined with the constraint $\zeta_{ext}^{T}X\mathrm{1}\leq 1$ (because $X\in\textbf{K}^{ext}_{\phi^{\prime}}$ ) ensures this fact). Also,

[TABLE]

In the second inequality we used the fact that solution $X$ returned by our rounding procedure always satisfies $X_{b_{1}+j,k}=0$ for all $j\in[1,b_{2}]$ , $k\in[0,b_{2}]$ and $k\neq j$ . ∎

Appendix D Algorithm for solving our convex program

To make this section self readable we start by recalling our original SDPML objective.

[TABLE]

We relaxed it to:

[TABLE]

where function $\textbf{g}(X)$ is defined as:

[TABLE]

For $\textbf{f}(X)\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\log\textbf{g}(X)$ the optimization problem can be formulated equivalently as:

[TABLE]

where the constraint set $\textbf{K}^{f}_{\phi^{\prime}}$ is given by

[TABLE]

and function $\textbf{f}(X)$ is:

[TABLE]

Our constraint set $\textbf{K}^{f}_{\phi^{\prime}}$ is bounded and for any $X\in\textbf{K}^{f}_{\phi^{\prime}}$ ,

[TABLE]

However on the other hand our function $\textbf{f}(X)$ is not well behaved as the boundedness of f doesn’t imply any good polynomial bound on $\|X\|_{F}^{2}$ . We leverage the fact that our feasible set is bounded to define a new function which is close to our original function f inside the feasible region and is also well behaved outside it. Define:

[TABLE]

where $\textbf{C}=\mathrm{m}\log(\zeta)^{T}$ and for any $X\in\textbf{K}^{f}_{\phi}$ : $|\textbf{f}(X)-\hat{\textbf{f}}(X)|\leq o(\gamma)$ . Hence optimizing $\textbf{f}(X)$ is equivalent to optimzing $\hat{\textbf{f}}(X)$ in an approximate sense:

[TABLE]

Let $X_{pml,\phi}$ be the matrix $X\in\textbf{K}^{f}_{\phi^{\prime}}$ which corresponds to distribution $p_{pml,\phi}$ . Recall the maximum PML objective $\textbf{w}_{1}(p_{pml,\phi},\phi)$ is a probability term and is not hard to see that it is always between $[\exp\left(-n\log n\right),1]$ (lower bound comes from uniform distribution on $[n]$ ) and $\textbf{w}_{2}(X_{pml,\phi})$ , $\textbf{g}(X_{pml,\phi})\in[\exp\left(-2n\log n\right),\exp\left(n\log n\right)]$ (using a crude approximation) because they approximate the value of $\textbf{w}_{1}(p_{pml,\phi},\phi)$ . Combining all we get that optimum value of both optimization problems in Equation 33 are always greater than $-n^{2}$ .

In the rest of the section we show how to solve the optimization problem:

[TABLE]

which can be equivalently written as:

[TABLE]

where the convex set $\textbf{K}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\{(X,\textbf{t})\in\left(\mathbb{R}^{b_{1}\times b_{2}},\mathbb{R}\right)~{}|~{}\hat{\textbf{f}}(X)\geq\textbf{t}\text{ and }\textbf{t}\geq-n^{2}\}$ .

First we show how to solve a simple optimization problem which in turn will act as an oracle to solve our main optimization problem 46 using cutting plane method from [LSW15a]. The simple optimization problem which we will refer to as oracle here on is stated next:

[TABLE]

where $D\in\mathbb{R}^{b_{1}\times b_{2}}$ , $c\in\mathbb{R}$ , K is the same convex set and $\hat{\textbf{f}}(\cdot)$ is the same convex function defined above.

We implement the oracle, that is, solve optimization problem 35, by solving a sequence of unconstrained problems that penalize leaving the set K. Formally, for all $\alpha\in\mathbb{R}_{\geq 0}$ we define:

[TABLE]

To implement our oracle we will show how solve the following to high precision

[TABLE]

Our result will then follow by performing binary search on $\alpha$ and invoking this subroutine.

For any $\alpha$ let $(X^{(\alpha)},\textbf{t}^{\left(\alpha\right)})$ be the optimal solution for optimization problem 37 and also let $(X^{*},\textbf{t}^{*})$ be the optimal solution to 35. It is clear that:

[TABLE]

The second to last inequality follows because $\hat{\textbf{f}}(X^{*})\geq\textbf{t}^{*}$ . Hence we have:

[TABLE]

Higher the value of $\alpha$ more incentive is to satisfy the constraint.

Lemma D.1.

For all $\alpha>0$ the following holds

[TABLE]

where $(X^{(\alpha)},\textbf{t}^{\left(\alpha\right)})$ is the optimum solution pair for optimization problem 37.

Proof.

Direct calculation shows that the following derivatives for $\textbf{h}^{\left(\alpha\right)}$ hold for all input:

[TABLE]

By the optimality of $X^{(\alpha)}$ and $\textbf{t}^{\left(\alpha\right)}$ we know these derivatives are [math] at $(X^{(\alpha)},\textbf{t}^{\left(\alpha\right)})$ and therefore:

[TABLE]

Consequently,

[TABLE]

and substituting this and the value of $\textbf{t}^{\left(\alpha\right)}$ into the formula for H yields

[TABLE]

Combining this equality with the following upper bound for $\textbf{H}(\alpha)$ yields the result:

[TABLE]

∎

Corollary D.2.

For any $\delta>0$ and $\alpha>\textbf{B}_{\alpha,\delta}$ , where $\textbf{B}_{\alpha,\delta}=\max(\sqrt{\frac{\|d\|^{2}+c^{2}}{\lambda}+\delta\lambda+|c|}~{},~{}\frac{1}{4\lambda^{2}n^{2}}(\|d\|^{2}+c^{2})+\lambda n^{2}+|c|+\frac{\delta}{n^{2}}~{},~{}1)$

[TABLE]

Proof.

Suppose $\hat{\textbf{f}}(X^{(\alpha)})<\textbf{t}^{\left(\alpha\right)}+\epsilon$ , then by Lemma D.1, it holds that:

[TABLE]

The final inequality follows from the conditions of the corollary. ∎

Next we show that $X^{(\alpha)}$ is differentiable with respect to $\alpha$ and therefore, H, $\hat{\textbf{f}}(X^{(\alpha)})-\textbf{t}^{\left(\alpha\right)}$ , and $\|X^{(\alpha)}\|_{F}^{2}$ are continuous with respect to $\alpha$ . The crux is the following, simple, possibly well known fact whose proof is a slight modification of that in (cite geometric median).

Lemma D.3.

Let $\hat{\textbf{f}}:\mathbb{R}^{n+1}\rightarrow\mathbb{R}$ be a twice differentiable function and for all $x\in\mathbb{R}^{n}$ and $\alpha\in\mathbb{R}$ define the function $\hat{\textbf{f}}_{\alpha}:\mathbb{R}^{n}\rightarrow\mathbb{R}$ by $\hat{\textbf{f}}_{\alpha}(x)=\hat{\textbf{f}}(x,\alpha)$ and let $x_{\alpha}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\operatorname*{arg\,max}_{x\in\mathbb{R}^{n}}f_{\alpha}(x)$ . If $\hat{\textbf{f}}_{\alpha}$ is strictly concave for all $\alpha\in\mathbb{R}$ then $x_{\alpha}$ is differentiable as a function of $\alpha$ .

Proof.

By the optimality conditions for $x_{\alpha}$ we know that $\nabla\hat{\textbf{f}}_{\alpha}(x_{\alpha})=\vec{0}$ . Consequently, since $\hat{\textbf{f}}$ is differentiable, differentiating with respect to $\alpha$ yields by chain rule that

[TABLE]

However, since $\hat{\textbf{f}}$ is strictly concave, all eigenvalues of this matrix are negative and this matrix is invertible yielding the desired result. ∎

Lemma D.4.

Functions $\textbf{H}(\alpha)$ , $\hat{\textbf{f}}(X^{(\alpha)})-\textbf{t}^{\left(\alpha\right)}$ and $\|X^{(\alpha)}\|_{F}^{2}$ are continuous in $\alpha$ .

Proof.

Since H is twice differentiable and H is strictly concave, Lemma D.3 implies that $X^{(\alpha)}$ is differentiable and therefore continuous as a function of $\alpha$ . Since $\hat{\textbf{f}}$ and $\|X\|_{F}^{2}$ are continuous functions the result follows. ∎

Lemma D.5.

Let $X^{(1)},X^{(2)}$ be the optimum solutions to Optimization problem 37 with respect to $\alpha^{(1)}$ and $\alpha^{(2)}$ respectively. For any $\alpha^{(1)}<\alpha^{(2)}$ :

[TABLE]

Proof.

Suppose that $\hat{\textbf{f}}(X^{(1)})-\textbf{t}^{(1)}>0$ as the proof for when $\hat{\textbf{f}}(X^{(2)})-\textbf{t}^{(2)}<0$ is analogous. Then since $\alpha^{(1)}<\alpha^{(2)}$ we have $\alpha^{(1)}(\hat{\textbf{f}}(X^{(1)})-\textbf{t}^{(2)})<\alpha^{(2)}(\hat{\textbf{f}}(X^{(1)})-\textbf{t}^{(2)})$ and

[TABLE]

The result follows as $\textbf{H}(\alpha^{(1)})=\textbf{h}^{\left(1\right)}(X^{(1)},\textbf{t}^{(1)})$ and $\textbf{h}^{\left(2\right)}(X^{(1)},\textbf{t}^{(1)})\leq\textbf{h}^{\left(2\right)}(X^{(2)},\textbf{t}^{(2)})=\textbf{H}(\alpha^{(2)})$ . ∎

Corollary D.6.

Let $X^{(1)},X^{(2)}$ be the optimum solutions to Optimization problem 37 with respect to $\alpha^{(1)}$ and $\alpha^{(2)}$ respectively. For any $\alpha^{(1)}<\alpha^{(2)}$ :

[TABLE]

Proof.

Given $\alpha^{(1)}<\alpha^{(2)}$ and $\hat{\textbf{f}}(X^{(1)})-\textbf{t}^{(1)}>0$ . By first part of the Lemma D.5 $\textbf{H}(\alpha^{(1)})<\textbf{H}(\alpha^{(2)})$ . Suppose $\hat{\textbf{f}}(X^{(2)})-\textbf{t}^{(2)}<0$ by the second part of same Lemma D.5 we have $\textbf{H}(\alpha^{(1)})>\textbf{H}(\alpha^{(2)})$ A contradiction! ∎

Lemma D.7.

For any $\alpha>0$ ,

[TABLE]

where $\textbf{B}_{X}=\max(\frac{\|D\|_{2}^{2}}{4\lambda^{2}}+\frac{n^{9}}{\gamma^{3}},1)$

Proof.

Observe that we can optimize problem 37 with respect to $X$ and t independently. Lets look at the function behaviour $\textbf{H}(\alpha)$ with respect to $X$ . From equation 40-42 we have:

[TABLE]

Also note that $\hat{\textbf{f}}(X)<0$ for $\|X\|_{F}^{2}\geq\frac{n^{6}}{\gamma}$ because the term $\frac{\gamma}{n^{3}}\|X\|_{F}^{2}$ dominates and also there is a trivial solution with $\hat{\textbf{f}}(0)=0$ . Combining all we get $\max_{X}\hat{\textbf{f}}(X)=\max_{\{X~{}|~{}\|X\|_{F}^{2}\leq\frac{n^{6}}{\gamma}\}}\hat{\textbf{f}}(X)$ and the function $\hat{\textbf{f}}(X)\leq O(\frac{n^{6}}{\gamma^{2}})$ because all $|C_{i,j}|\leq O(n\log n)$ are bounded.

[TABLE]

∎

Lemma D.8.

For any $\alpha>0$ , we can find a solution $(X^{(\epsilon)},\textbf{t}^{(\epsilon)})$ such that $\|X^{(\epsilon)}-X^{(\alpha)}\|_{1}\leq\epsilon\text{ and }\textbf{t}^{(\epsilon)}=\textbf{t}^{\left(\alpha\right)}$ in time $O(b_{1}\cdot b_{2}\log\left(\frac{\textbf{B}_{X}}{\epsilon}\right))$ .

Proof.

Lets recall the objective of optimization problem 37:

[TABLE]

Lets recall the optimality conditions from Equation 39:

[TABLE]

Rearranging terms and taking exponential on Equation 43 yields:

[TABLE]

where $\textbf{a}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\exp\left(\frac{2\lambda}{\alpha}+\frac{2\gamma}{n^{3}}\right)>1$ and $\textbf{b}_{ij}=\exp\left(\frac{D_{ij}-\textbf{C}_{ij}}{\alpha}\right)$ . Let $\textbf{K}^{\alpha,i}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}(X^{(\alpha)}\mathrm{1})_{i}$ and we define new variables $Y^{(\alpha)}_{ij}$ which satisfy the following conditions,

[TABLE]

and we know that $(Y^{(\alpha)}\mathrm{1})_{i}\textbf{K}^{\alpha,i}=(X^{(\alpha)}\mathrm{1})_{i}=\textbf{K}^{\alpha,i}$ and $Y^{(\alpha)}_{ij}$ should satisfy $(Y^{(\alpha)}\mathrm{1})_{i}=1$ . Lets rewrite Equation 44 in terms of $Y^{(\alpha)}_{ij}$ variables:

[TABLE]

This can be written equivalently as:

[TABLE]

From Lemma D.7, we can do binary search in $[0,\textbf{B}_{X}]$ to guess $\textbf{K}^{\alpha,i}$ . Let $\ell$ and $u$ be the current lower and upper bounds for the value of $\textbf{K}^{\alpha,i}$ : Assign $\textbf{K}=\frac{\ell+u}{2}$ and we can do binary search to find $Y^{(K)}_{ij}$ such that $\left(\textbf{a}^{\textbf{K}}\right)^{Y^{(K)}_{ij}}Y^{(K)}_{ij}=\textbf{b}_{ij}$ because for fixed a and K function $\left(\textbf{a}^{\textbf{K}}\right)^{Y}Y$ is monotone (increasing) in $Y$ ( $\because\textbf{a}^{\textbf{K}}>1$ ).

If $(Y^{(K)}\mathrm{1})_{i}=1$ , assign $X^{(\alpha)}_{ij}=Y^{(K)}_{ij}\textbf{K}$ and Equation 44 is satisfied and we are done. 2. 2.

If $(Y^{(K)}\mathrm{1})_{i}<1$ , update $u=\textbf{K}$ that is decrease our guess for $\textbf{K}^{\alpha,i}$ to $\frac{\ell+\textbf{K}}{2}$ and observe that next iteration values of all $Y^{(K)}_{ij}$ increase as $\textbf{b}_{ij}$ is fixed. 3. 3.

Else If $(Y^{(K)}\mathrm{1})_{i}>1$ , update $\ell=\textbf{K}$ because of the similar analysis as case above. 4. 4.

Assign $\textbf{K}=\frac{\ell+u}{2}$ and repeat.

Note we never have to work with $Y_{ij}$ variables, we introduced them to better understand our binary search procedure. From Lemma D.7 we have a good bound on $\|X^{(\alpha)}\|_{F}^{2}$ and the above procedure finds a solution $(X^{(\epsilon)},\textbf{t}^{(\epsilon)})$ such that $\|X^{(\epsilon)}-X^{(\alpha)}\|_{1}\leq\epsilon$ and $\textbf{t}^{(\epsilon)}=\textbf{t}^{\left(\alpha\right)}$ (because we have closed form expression for $\textbf{t}^{\left(\alpha\right)}$ ) in time $O(b_{1}\cdot b_{2}\log\left(\frac{\textbf{B}_{X}}{\epsilon}\right))$ . ∎

Lemma D.9.

Optimization problem 35 can be solved to $\delta$ accuracy in time $O(b_{1}\cdot b_{2}\log(\frac{\textbf{B}_{X}}{\epsilon_{2}})\log(\textbf{B}_{\alpha,\delta}))$ .

Proof.

First we show that solving optimization problem 37 for $\alpha^{*}$ for which the solution pair $(X^{*},\textbf{t}^{*})$ satisfies $\epsilon_{1}<f(X^{*})-\textbf{t}^{*}<2\epsilon_{1}$ for $\epsilon_{1}=\frac{\delta}{4\alpha}$ solves our main problem 35. Observe that the solution pair $(X^{*},\textbf{t}^{*})$ satisfies our constraint and also our objective value for problem 37 at $(X^{*},\textbf{t}^{*})$ is greater than $\textbf{OPT}-\frac{\delta}{2}$ as shown below:

[TABLE]

The first inequality follows because $\epsilon_{1}<f(X^{*})-\textbf{t}^{*}<2\epsilon_{1}$ and the later one follows from Equation 38. By similar reasoning we are also done if at $\alpha=0$ the optimal solution pair $(X^{(\alpha)},\textbf{t}^{\left(\alpha\right)})$ (has closed form solution) satisfies the constraint $\hat{\textbf{f}}(X^{(\alpha)})>\textbf{t}^{\left(\alpha\right)}$ and it is interesting if this constraint is not satisfied at $\alpha=0$ . In such a case existence of $\alpha^{*}$ such that $\epsilon_{1}<f(X^{*})-\textbf{t}^{*}<2\epsilon_{1}$ follows from continuity (Lemma D.4) and boundedness of $\alpha$ (Corollary D.2) for which the constraint $\epsilon_{1}<f(X^{*})-\textbf{t}^{*}<2\epsilon_{1}$ is satisfied. Corollary D.6, and D.5 suggests that we can find an $\alpha$ by binary search over the interval $(0,\textbf{B}_{\alpha,\delta}]$ such that $\epsilon_{1}<f(X^{*})-\textbf{t}^{*}<2\epsilon_{1}$ and Lemma D.8 finds a solution $X^{(\epsilon)}$ such that $\|X^{(\epsilon)}-X^{(\alpha)}\|_{1}\leq\epsilon_{2}$ . Choose $\epsilon_{2}<\frac{\epsilon_{1}}{poly(\textbf{B}_{X},\textbf{B}_{\alpha,\epsilon})n^{10}}$ .

$\bullet$ If $(X^{(\alpha)}\mathrm{1})_{i}\leq\frac{\epsilon_{1}}{n^{5}}$ then so is $(X^{(\epsilon)}\mathrm{1})_{i}\leq\frac{2\epsilon_{1}}{n^{5}}$ and value of $|X^{(\alpha)}_{i,j}\log(X^{(\alpha)}\mathrm{1})_{i}-X^{(\epsilon)}_{i,j}\log(X^{(\epsilon)}\mathrm{1})_{i}|\leq|X^{(\alpha)}_{i,j}\log(X^{(\alpha)}\mathrm{1})_{i}|+|X^{(\epsilon)}_{i,j}\log(X^{(\epsilon)}\mathrm{1})_{i}|\leq O(\frac{|\epsilon_{1}\log\epsilon_{1}|}{n^{5}})$ .

$\bullet$ Else $(X^{(\alpha)}\mathrm{1})_{i}>\frac{\epsilon_{1}}{n^{5}}$ then so is $(X^{(\epsilon)}\mathrm{1})_{i}>\frac{\epsilon_{1}}{2n^{5}}$ and value of $|X^{(\alpha)}_{i,j}\log(X^{(\alpha)}\mathrm{1})_{i}-X^{(\epsilon)}_{i,j}\log(X^{(\epsilon)}\mathrm{1})_{i}|\leq|X^{(\alpha)}_{i,j}\log(\frac{(X^{(\alpha)}\mathrm{1})_{i}}{(X^{(\epsilon)}\mathrm{1})_{i}})|+|\epsilon_{2}\log(\frac{(X^{(\alpha)}\mathrm{1})_{i}}{(X^{(\epsilon)}\mathrm{1})_{i}})|\leq|X^{(\alpha)}_{i,j}\log(1\pm\frac{\epsilon_{2}}{(X^{(\epsilon)}\mathrm{1})_{i}})|+|\epsilon_{2}\log(1\pm\frac{\epsilon_{2}}{(X^{(\epsilon)}\mathrm{1})_{i}})|\leq|X^{(\alpha)}_{i,j}\frac{\epsilon_{2}}{(X^{(\epsilon)}\mathrm{1})_{i}}|+|\epsilon_{2}\frac{\epsilon_{2}}{(X^{(\epsilon)}\mathrm{1})_{i}}|\leq O(\epsilon_{2})$ .

We can do similar analysis for other terms in $\hat{\textbf{f}}(X)$ and the boundedness of $|\hat{\textbf{f}}(X^{(\alpha)})-\hat{\textbf{f}}(X^{(\epsilon)})|$ follows because:

[TABLE]

Recall $\textbf{t}^{(\epsilon)}=\textbf{t}^{\left(\alpha\right)}$ and combined with inequality above and $\epsilon_{1}<f(X^{*})-\textbf{t}^{*}<2\epsilon_{1}$ implies:

[TABLE]

Now all that remains is to bound the objective value of optimization problem 35 $D\cdot X^{(\epsilon)}+c\textbf{t}^{(\epsilon)}-\lambda\left(\|X^{(\epsilon)}\|_{F}^{2}+(\textbf{t}^{(\epsilon)})^{2}\right)$ .

[TABLE]

The whole procedure can be implemented in time $O(b_{1}\cdot b_{2}\log(\frac{\textbf{B}_{X}}{\epsilon_{2}})\log(\textbf{B}_{\alpha,\delta}))$ ∎

Now we are in good shape to solve our main optimization problem 46. First we write our optimization problem in vector form:

[TABLE]

where the convex set $\textbf{K}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\{(x,\textbf{t})\in(\mathbb{R}^{b_{1}\cdot(b_{2}+1)},\mathbb{R})~{}|~{}\hat{\textbf{f}}(x)\geq\textbf{t}\text{ and }\textbf{t}\geq-n^{2}\}$ and our matrix $\textbf{A}\in\mathbb{R}^{(b_{2}+1)\times b_{1}\cdot(b_{2}+1)}$ 777Our matrix A is a sparse matrix and matrix vector product with it can be computed in time $O(b_{1}\cdot b_{2})$ and with vector $\textbf{b}\in\mathbb{R}^{b_{2}+1}$ represent the linear constraints in the set $\textbf{K}^{f}_{\phi^{\prime}}$ .

[TABLE]

Formulation above is in the form of a general optimization problem $(11.14)$ in [LSW15a]. For convenience we redefine the optimization problem $(11.14)$ from [LSW15a]:

[TABLE]

where K is a convex set. To invoke the algorithm to solve this general optimization problem algorithm in [LSW15a] requires to implement a $\delta$ -2nd-order-optimization oracle which is define below:

Definition D.10.

Given a convex set K and $\delta>0$ . A $\delta$ -2nd-order-optimization oracle for K is a function on $\mathbb{R}^{b_{2}}$ such that for any input $c\in\mathbb{R}^{b_{2}}$ and $\lambda>0$ , it outputs $y$ such that

[TABLE]

We denote by $OO^{(2)}_{\lambda.\delta}(\textbf{K})$ the time complexity of this oracle

Our simple optimization problem 35 is exactly the $\delta$ -2nd-order-optimization oracle for our main optimization problem 46. Consequently, all the remains to solve optimization problem 46 is to bound the eigenvalues of $\textbf{A}\textbf{A}^{\top}$ and put together the results of this section to obtain our desired running time. We do this in Lemma F.8 and Theorem D.12 respectively.

Lemma D.11.

The eigenvalues of matrix $\textbf{A}\textbf{A}^{\top}$ are either $b_{1}$ or of the form

[TABLE]

and therefore the smallest eigenvalue of $\textbf{A}\textbf{A}^{\top}$ is at least

[TABLE]

Proof.

Direct calculation shows that if $\vec{1}_{b_{2}}\in\mathbb{R}^{b_{2}}$ is $b_{2}$ -dimensional all ones vector, $I_{b_{2}}\in\mathbb{R}^{b_{2}\times b_{2}}$ is the $b_{2}$ -dimensional identity matrix and $\zeta\in\mathbb{R}^{b_{1}}$ with $\zeta_{i}=\frac{1}{2n^{2}}(1+\epsilon_{1})^{i}$ then for all $x\in\mathbb{R}^{b_{2}}$ and $\alpha\in\mathbb{R}$ we have

[TABLE]

Consequently $v=(x,\alpha)^{T}$ is an eigenvector of $\textbf{A}\textbf{A}^{\top}$ with eigenvalue $\lambda$ if and only if

[TABLE]

Now if $x\perp\vec{1}_{b_{2}}$ then we see the $v$ is an eigenvector if and only if $\alpha=0$ in which case the eigenvalue is $b_{1}$ . On the other hand if $x=\vec{1}_{b_{2}}$ then we see $v$ is an eigenvector of eigenvalue $\lambda$ if and only if

[TABLE]

When this happens we have $b_{2}\|\zeta\|_{1}+\alpha(b_{2}+1)\|\zeta\|_{2}^{2}=b_{1}\alpha+\alpha^{2}\|\zeta\|_{1}$ and solving for $\alpha$ yields that

[TABLE]

Substituting this into $\lambda=b_{1}+\alpha\|\zeta\|_{1}$ yields the eigenvalues.

[TABLE]

The lower bound follows from the fact that for $\sqrt{1-a}\leq\sqrt{(1-a/2)^{2}}=1-a/2$ when $a>0$ and therefore

[TABLE]

The smallest eigenvalue is at least $\frac{2b_{1}(b_{2}+1)\|\zeta\|_{2}^{2}-2b_{2}\|\zeta\|_{1}^{2}}{b_{1}+(b_{2}+1)\|\zeta\|_{2}^{2}}$ . Recall $b_{1}=\theta(\frac{\log n}{\epsilon_{1}})$ and $b_{1}$ is such that $\frac{1}{n^{2}}(1+\epsilon_{1})^{b_{1}}\geq 1$ and $\frac{1}{n^{2}}(1+\epsilon_{1})^{b_{1}-1}<1$ . Lemma statement follows because $\|\zeta\|_{1}=\sum_{i=0}^{b_{1}}\frac{1}{2n^{2}}(1+\epsilon_{1})^{i}=\theta(\frac{1}{\epsilon_{1}})$ , $\|\zeta\|_{2}^{2}=\sum_{i=0}^{b_{1}}\frac{1}{4n^{4}}(1+\epsilon_{1})^{2i}=\theta(\frac{1}{\epsilon_{1}})$ .

∎

Below is the theorem we invoke to solve the optimization problem.

Theorem D.12 (Theorem 56 from [LSW15b]).

Assume that $\max_{x\in\textbf{K}}\left\|x\right\|_{2}<M$ , $\big{\|}b\big{\|}_{2}<M$ , $\big{\|}c\big{\|}_{2}<M$ , $\big{\|}\textbf{A}\big{\|}_{2}<M$ and $\lambda_{\min}(\textbf{A})>1/M$ . Assume that $\textbf{K}\cap\{\textbf{A}x=b\}\neq\emptyset$ and we have $\epsilon$ -2nd-order-optimization oracle for every $\epsilon>0$ . For $0<\delta<1$ , we can find $z\in\textbf{K}$ such that

[TABLE]

and $\big{\|}\textbf{A}z-b\big{\|}_{2}\leq\delta$ . This algorithm takes time

[TABLE]

where $r$ is the number of rows in A, $\eta=\left(\frac{\delta}{nM}\right)^{\Theta(1)}$ and $\lambda=\left(\frac{\delta}{nM}\right)^{\Theta(1)}$ .

Theorem D.13.

Optimization problem 46 can be solved in time $O\left(b_{2}^{2}b_{1}\log^{O(1)}(b_{1}b_{2})+b_{2}^{3}\log^{O(1)}(b_{1}b_{2})\right)$

Proof.

The proof follows by combining Lemmas D.12, D.8, D.9 and noting that all the parameters in the running time $\|d\|_{2},|c|$ , $1/\lambda$ are all bounded by $O(poly(b_{1},b_{2}))$ and we only pay logarithm in these terms. ∎

Appendix E Proofs for multidimensional PML

Here we show how our techniques built throughout this paper apply to a general setting. In particular, we provide an efficient algorithm for computing approximate PML in higher dimensions when the dimension is constant. The proofs and techniques are analogous to one dimensional PML but there are few places such as, minimum probability lemma proof, singular value lower bound for the constraint matrix (for optimization) where we require general proofs.

E.1 Preliminaries for $d$ -dimensional objects

$d$ -tuple: c is a $d$ -tuple if $\textbf{c}\in\mathbb{R}^{d}$ . For all $k\in[1,d]$ , we use $\textbf{c}(k)$ to denote its $k$ ’th element.

Arithmetic operations on $d$ -tuples: For any two $d$ -tuples c, $\textbf{c}^{\prime}$ and an arithmetic operator $\mathrm{op}\in\{+,\times,-,/\}$ , the operation $\textbf{c}~{}\mathrm{op}~{}\textbf{c}^{\prime}$ denotes element wise operation, meaning it outputs another $d$ -tuple equal to $(\textbf{c}(1)~{}\mathrm{op}~{}\textbf{c}^{\prime}(1),\dots,\textbf{c}(d)~{}\mathrm{op}~{}\textbf{c}^{\prime}(d))$ . Further for any $d$ -tuple c and scalar $s$ , the operation $\textbf{c}~{}\mathrm{op}~{}s$ denotes element wise scalar operation, meaning it outputs another $d$ -tuple equal to $(\textbf{c}(1)~{}\mathrm{op}~{}s,\dots,\textbf{c}(d)~{}\mathrm{op}~{}s)$ . Just in the case of power operation $\textbf{c}^{\textbf{c}^{\prime}}$ we return a scalar value and is equal to:

[TABLE]

Also for a $d$ -tuple c and scalar $s$ we define:

[TABLE]

Logic operations on $d$ -tuples: For any two $d$ -tuples c and $\textbf{c}^{\prime}$ and a logic operator $\mathrm{op}\in\{\leq,\geq,=\}$ , the operation $\textbf{c}~{}\mathrm{op}~{}\textbf{c}^{\prime}$ is true if and only if $\textbf{c}(k)~{}\mathrm{op}~{}\textbf{c}^{\prime}(k)$ is true for all $k\in[1,d]$ . Further for any $d$ -tuple c and scalar $s$ , the logic operation $\textbf{c}~{}\mathrm{op}~{}s$ is true iff $\textbf{c}(k)~{}\mathrm{op}~{}s$ is true for all $k\in[1,d]$ .

Floor and ceil operations on $d$ -tuples: For a $d$ -tuple c and set S of $d$ -tuples we use the notation $\lfloor\textbf{c}\rfloor_{\textbf{S}}$ and $\lceil\textbf{c}\rceil_{\textbf{S}}$ to denote the following $d$ -tuples:

[TABLE]

We next recall (defined in Section 3.1) the setting for higher dimensions.

Setting for higher dimension: For each $k\in[1,d]$ , we receive a sequence $\textbf{y}^{\textbf{n}(k)}$ that consists of $\textbf{n}(k)$ independent samples drawn from an underlying distribution $\textbf{p}(k)$ supported on same domain $\mathcal{D}$ , further $\textbf{y}^{\textbf{n}(k)}$ is independent of other sequences $\textbf{y}^{\textbf{n}(k^{\prime})}$ for $k^{\prime}\in[1,d]$ and $k^{\prime}\neq k$ . We call $\textbf{y}^{\textbf{n}}=(\textbf{y}^{\textbf{n}{(1)}},\dots\textbf{y}^{\textbf{n}{(d)}})$ a $d$ -sequence and $\textbf{n}=(\textbf{n}(1),\dots,\textbf{n}(d))$ its $d$ -length. Let $\mathcal{D}^{\textbf{n}}$ be the set of all $d$ -sequences of $d$ -length equal to n. We use $\textbf{p}_{x}(k)$ to denote the probability of domain element $x$ in distribution $\textbf{p}(k)$ . We also refer $\textbf{p}=(\textbf{p}(1),\dots,\textbf{p}(d))$ as a $d$ -distribution and let $\Delta^{\mathcal{D},d}$ be the set of all $d$ -distributions.

For any $d$ -distribution $\textbf{p}\in\Delta^{\mathcal{D},d}$ , the probability of a $d$ -sequence $\textbf{y}^{\textbf{n}}$ is defined as:

[TABLE]

Recall for each $k\in[1,d]$ , $\textbf{f}(\textbf{y}^{\textbf{n}(k)},x)$ is the frequency of domain element $x$ in sequence $\textbf{y}^{\textbf{n}(k)}$ . For any $d$ -sequence $\textbf{y}^{\textbf{n}}$ , we call $\textbf{f}(\textbf{y}^{\textbf{n}},x)=(\textbf{f}(\textbf{y}^{\textbf{n}{(1)}},x),\dots,\textbf{f}(\textbf{y}^{\textbf{n}{(d)}},x))$ the $d$ -frequency of domain element $x$ in $\textbf{y}^{\textbf{n}}$ . Let $\textbf{F}^{\textbf{n}}$ be the set of all $d$ -frequencies generated by different domain elements in all possible $d$ -sequences in $\mathcal{D}^{\textbf{n}}$ and we use $\textbf{f}_{j}\in\textbf{F}^{\textbf{n}}$ to denote its $j$ th element.

We next define few more $d$ -dimensional objects of interest.

$d$ -vector: $\textbf{v}=(\textbf{v}{(1)},\dots,\textbf{v}{(d)})$ is a $d$ -vector if for each element $k\in[1,d]$ , $\textbf{v}{(k)}$ is a vector supported on the same domain $\mathcal{D}$ . We use $\textbf{v}_{x}$ to denote the row corresponding to domain element $x$ and $\textbf{v}(k)$ to denote its $k$ ’th column. Let $\Delta_{\mathrm{vector}}^{\mathcal{D},d}$ be the set of all $d$ -vectors and note that $d$ -distribution is a $d$ -vector.

Norm of $d$ -vectors: For a $d$ -vector v, its norm denoted by $\|\textbf{v}\|$ is a $d$ -tuple equal to $(\|\textbf{v}{(1)}\|,\dots,\|\textbf{v}{(d)}\|)$ .

$d$ -pseudodistribution: $\textbf{q}=(\textbf{q}{(1)},\dots,\textbf{q}{(d)})$ is a $d$ -pseudodistribution if for each element $k\in[1,d]$ , $\textbf{q}{(k)}$ is a pseudo-distribution supported on the same domain $\mathcal{D}$ or equivalently $\|\textbf{q}\|_{1}\leq 1$ . Let $\Delta_{\mathrm{pseudo}}^{\mathcal{D},d}$ be the set of all $d$ -pseudodistributions and $\Delta^{\mathcal{D},d}\subset\Delta_{\mathrm{pseudo}}^{\mathcal{D},d}\subset\Delta_{\mathrm{vector}}^{\mathcal{D},d}$ .

$d$ -level set: For a $d$ -distribution p and $d$ -pseudodistribution q, we call $\textbf{p}_{x}$ and $\textbf{q}_{x}$ $d$ -level sets corresponding to $x$ respectively.

$d$ -Type: For any $d$ -sequence $\textbf{y}^{\textbf{n}}$ , $\psi=(\Psi(\textbf{y}^{\textbf{n}{(1)}}),\dots,\Psi(\textbf{y}^{\textbf{n}{(d)}}))$ represents $d$ -type of $\textbf{y}^{\textbf{n}}$ and we call n its $d$ -length. Recall $\psi(k)\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\Psi(\textbf{y}^{\textbf{n}(k)})$ is the type of sequence $\textbf{y}^{\textbf{n}(k)}$ and we overload notation and let $\psi=\Psi(\textbf{y}^{\textbf{n}})$ denote $(\Psi(\textbf{y}^{\textbf{n}{(1)}}),\dots,\Psi(\textbf{y}^{\textbf{n}{(d)}}))$ . We use $\psi_{x}=\textbf{f}(\textbf{y}^{\textbf{n}},x)$ to denote the row corresponding to domain element $x$ and $\psi_{x}(k)=\psi(k)_{x}=\textbf{f}(\textbf{y}^{\textbf{n}(k)},x)$ all mean the same thing. Let $\Psi^{n}$ be the set of all $d$ -types of $d$ -length equal to n.

For a $d$ -distribution $\textbf{p}\in\Delta^{\mathcal{D},d}$ , the probability of a $d$ -type $\psi\in\Psi^{n}$ is:

[TABLE]

We use the following shorthand notation to denote the counting term in the above expression.

[TABLE]

$d$ -Profile: For any $d$ -sequence $\textbf{y}^{\textbf{n}}\in\mathcal{D}^{\textbf{n}}$ , $\phi=\Phi(\textbf{y}^{\textbf{n}})$ is a $d$ -profile if $\phi=(\phi_{j})_{j=1\dots|\textbf{F}^{\textbf{n}}|}$ and $\phi_{j}=|\{x\in\mathcal{D}~{}|~{}\textbf{f}(\textbf{y}^{\textbf{n}},x)=\textbf{f}_{j}\}|$ 999The $d$ -profile does not contain $(0,\dots,0)$ $d$ -frequency element because we don’t know the number of unseen domain symbols. is the number of domain elements with $d$ -frequency $\textbf{f}_{j}$ . We call n the $d$ -length of $\phi$ and use $\Phi^{n}$ to denote the set of all $d$ -profiles of $d$ -length equal to n.

For any $d$ -distribution $\textbf{p}\in\Delta^{\mathcal{D},d}$ , the probability of a $d$ -profile $\phi\in\Phi^{n}$ is defined as:

[TABLE]

We can also define the $d$ -profile of a $d$ -type $\psi$ . We overload notation and use $\phi=\Phi(\psi)$ to denote the $d$ -profile associated with $d$ -type $\psi$ and $\phi_{j}=\phi_{j}(\psi)\stackrel{{\scriptstyle\mathrm{def}}}{{=}}|\{x\in\mathcal{D}~{}|~{}\psi_{x}=\textbf{f}_{j}\}|$ . Consider all types $\psi$ such that $\Phi(\psi)=\phi$ and observe that they all have the same $\binom{\textbf{n}}{\psi}$ value. We use notation $C_{\phi}$ to represent this quantity:

[TABLE]

Profile maximum likelihood: For any $d$ -profile $\phi\in\Phi^{n}$ , a Profile Maximum Likelihood (PML) $d$ -distribution $\textbf{p}_{pml,\phi}\in\Delta^{\mathcal{D},d}$ is:

[TABLE]

and $\mathbb{P}(\textbf{p}_{pml,\phi},\phi)$ is the maximum PML objective value.

Approximate profile maximum likelihood: For any $d$ -profile $\phi\in\Phi^{n}$ , a $d$ -distribution $\textbf{p}^{\beta}_{pml,\phi}\in\Delta^{\mathcal{D},d}$ is a $\beta$ -approximate PML $d$ -distribution if

[TABLE]

Note: As in the case of one dimension, we extend and use the following definition for $\mathbb{P}(\textbf{v},y^{\textbf{n}})$ for any $d$ -vector. Further, for any probability terms defined in the future involving p, we assume those expressions are also defined for any $d$ -vector v just by replacing $\textbf{p}_{x}(k)$ by $\textbf{v}_{x}(k)$ everywhere and $\textbf{v}(k)_{x}=\textbf{v}_{x}(k)$ mean the same thing.

Probability discretization: Let $\textbf{P}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\{\zeta_{i}:i=1,\dots\textbf{b}\}$ be the set representing discretization of $d$ -probability space where for each $i\in[1,\textbf{b}]$ , $\zeta_{i}$ is a $d$ -level set. Further all elements in P are of the form $((1+\epsilon(1))^{1-i_{1}},\dots,(1+\epsilon(d))^{1-i_{d}})$ for some fixed $\epsilon\in\mathbb{R}_{>0}^{1\times d}$ and for all possible index $i_{k}\in[1,\textbf{b}_{k}]$ , where for each $k\in[1,d]$ , $\textbf{b}_{k}$ is such that $(1+\epsilon(k))^{1-\textbf{b}_{k}}\leq\frac{1}{2\textbf{n}(k)^{2}}$ and $\textbf{b}=\prod_{k=1}^{d}\textbf{b}_{k}$ .

Discrete $d$ -pseudodistribution: For any $d$ -distribution $\textbf{p}\in\Delta^{\mathcal{D},d}$ , its discrete $d$ -pseudodistribution $\textbf{q}=disc(\textbf{p})\in\Delta_{\mathrm{pseudo}}^{\mathcal{D},d}$ is defined as:

[TABLE]

We use $\Delta_{\mathrm{discrete}}^{\mathcal{D},d}$ to denote the set of all discrete $d$ -pseudodistributions. Note that $\lfloor\textbf{p}_{x}\rfloor_{\textbf{P}}\geq\frac{\textbf{p}_{x}}{1+\epsilon}$ and $\frac{1}{1+\epsilon}\leq||\textbf{q}||_{1}\leq 1$ .

Multiplicity discretization: Let $\textbf{M}=\{\mathrm{m}_{j}:j=1\dots\textbf{e}\}$ be the set representing discretization of multiplicity space where each element $\mathrm{m}_{j}$ represents a $d$ -frequency. Further each element $\mathrm{m}_{j}$ is of the following form: for each $k\in[1,d]$ , $\mathrm{m}_{j}(k)\in\{1,\lceil(1+\gamma(k)/2)^{1}\rceil,\lceil(1+\gamma(k)/2)^{2}\rceil,\dots,\lceil(1+\gamma(k)/2)^{\textbf{e}_{k}-1}\rceil,n\}\cup\{1,2,3,\dots,\lceil\frac{1}{\gamma(k)}\rceil\}$ for some fixed $\gamma\in\mathbb{R}^{1\times d}_{>0}$ and $\textbf{e}_{k}\in O(\frac{\log\textbf{n}(k)}{\gamma(k)})$ is such that $\lceil(1+\gamma(k)/2)^{\textbf{e}_{k}}\rceil\geq\textbf{n}(k)$ , $\lceil(1+\gamma(k)/2)^{\textbf{e}_{k}-1}\rceil<\textbf{n}(k)$ and as before $0<\gamma(k)<1$ . Note that $\textbf{e}=|\textbf{M}|=\prod_{k=1}^{d}\textbf{e}_{k}\in O(\prod_{k=1}^{d}\frac{\log\textbf{n}(k)}{\gamma(k)})$ .

Discrete $d$ -type: For a sequence $\textbf{y}^{\textbf{n}}\in\mathcal{D}^{n}$ , $\psi^{\prime}=\Psi^{\prime}(\textbf{y}^{\textbf{n}})\in\mathbb{R}^{\mathcal{D}\times d}$ is its discrete $d$ -type if $\psi^{\prime}_{x}=\lceil\textbf{f}(\textbf{y}^{\textbf{n}},x)\rceil_{\textbf{M}}$ .

Discrete $d$ -profile: For a $d$ -sequence $\textbf{y}^{\textbf{n}}\in\mathcal{D}^{\textbf{n}}$ , $\phi^{\prime}=\Phi^{\prime}(\textbf{y}^{\textbf{n}})\in\mathbb{Z}_{+}^{\textbf{M}}$ is a discrete $d$ -profile if $\phi^{\prime}=(\phi^{\prime}_{j})_{j=1\dots\textbf{e}}$ , where $\phi^{\prime}_{j}=|\{x\in\mathcal{D}~{}|~{}\lceil\textbf{f}(\textbf{y}^{\textbf{n}},x)\rceil_{\textbf{M}}=\mathrm{m}_{j}\}|$ and $\textbf{n}^{\prime}=\sum_{x\in\mathcal{D}}\lceil\textbf{f}(\textbf{y}^{\textbf{n}},x)\rceil_{\textbf{M}}\leq(1+\gamma)\times\textbf{n}$ is its $d$ -length.

E.2 Existence of Structured Approximate Solution

Here we show the existence of an approximate PML $d$ -distribution with a nice structure over the next several lemmas. First, we first show that one can assume the minimum non-zero probability of the PML $d$ -distribution is $\Omega(\frac{1}{\textbf{n}(k)^{2}})$ for each $k\in[1,d]$ by only loosing $\exp\left(-O(d)\right)$ in the PML objective value.

Lemma E.1 (Minimum probability lemma).

For any $d$ -profile $\phi\in\Phi^{n}$ , there exists a $d$ -distribution $\textbf{p}^{\prime\prime}\in\Delta^{\mathcal{D},d}$ such that $\textbf{p}^{\prime\prime}$ is a $\exp\left(-O(d)\right)$ -approximate PML $d$ -distribution with $\min_{x\in\mathcal{D}:\textbf{p}^{\prime\prime}_{x}(k)\neq 0}\textbf{p}^{\prime\prime}_{x}(k)\geq\frac{1}{2\textbf{n}(k)^{2}}$ for all $k\in[1,d]$ .

Proof.

See Section F.1. ∎

Next we show that working with discrete $d$ -level sets and $d$ -frequencies doesn’t significantly decrease the PML objective value. Our next lemma formally proves this statement.

Lemma E.2 (Probability discretization lemma).

For any $d$ -profile $\phi\in\Phi^{n}$ and $d$ -distribution $\textbf{p}\in\Delta^{\mathcal{D},d}$ , its discrete $d$ -pseudodistribution $\textbf{q}=\mathrm{disc}(\textbf{p})$ satisfies:

[TABLE]

Proof.

The first inequality is immediate because $\textbf{q}_{x}=\lfloor\textbf{p}_{x}\rfloor_{\textbf{P}}\leq\textbf{p}_{x}$ for all $x\in\mathcal{D}$ . To show second inequality consider any $d$ -sequence $\textbf{y}^{\textbf{n}}\in\mathcal{D}^{\textbf{n}}$ ,

[TABLE]

In the inequality above we use $\sum_{x\in\mathcal{D}}\textbf{f}(\textbf{y}^{\textbf{n}(k)},x)=\textbf{n}(k)$ for all $k\in[1,d]$ . Now,

[TABLE]

∎

Our previous lemma showed that we can work in the discretized probability space and in our next lemma we show that discretization of multiplicities also doesn’t change our objective value by much. For a $d$ -sequence $\textbf{y}^{\textbf{n}}\in\mathcal{D}^{n}$ , we first provide an equivalent formulation for the probability of its $d$ -profile $\phi=\Phi(\textbf{y}^{\textbf{n}})$ (from Equation 20 in [OSZ03], Equation 15 in [PJW17]) in terms of its $d$ -type $\psi=\Psi(\textbf{y}^{\textbf{n}})$ . The formulations provided [OSZ03], [PJW17] are for two dimensions and it is not hard to see these formulations generalize to higher dimension in the following way:

[TABLE]

where $S_{\mathcal{D}}$ is the set of all permutations of domain set $\mathcal{D}$ and $\phi_{0}$ is the number of domain elements with frequency $(0,\dots 0)$ (unseen domain elements). The difference between Equation 51 and Equation 50 is the index set over which they are summed.

Lemma E.3 (Profile discretization lemma).

For any $d$ -distribution $\textbf{p}\in\Delta^{\mathcal{D},d}$ , and a $d$ -sequence $\textbf{y}^{\textbf{n}}\in\mathcal{D}^{\textbf{n}}$ :

[TABLE]

where $\phi=\Phi(\textbf{y}^{\textbf{n}})$ and $\phi^{\prime}=\Phi^{\prime}(\textbf{y}^{\textbf{n}})$ are the $d$ -profile and discrete $d$ -profile of $\textbf{y}^{\textbf{n}}$ respectively.

Proof.

Let $\psi=\Psi(\textbf{y}^{\textbf{n}})$ and $\psi^{\prime}=\Psi^{\prime}(\textbf{y}^{\textbf{n}})$ be $d$ -type and discrete $d$ -type of $d$ -sequence $\textbf{y}^{\textbf{n}}$ respectively. By Equation 51:

[TABLE]

Similarly:

[TABLE]

where $\phi^{\prime}_{0}$ is the number of unseen domain elements in profile $\phi^{\prime}$ . Note $\phi^{\prime}_{0}=\phi_{0}$ because our discretization procedure does not change the number of unseen domain elements. We now analyze both objectives term by term. For any permutation $\sigma\in S_{\mathcal{D}}$

[TABLE]

The first inequality above follows because $\psi^{\prime}_{\sigma(x)}\leq\psi_{\sigma(x)}\times(1+\gamma)$ and using $\psi_{\sigma(x)}\leq\psi^{\prime}_{\sigma(x)}$ we get the right hand side of the following inequality.

[TABLE]

Lets consider terms $\binom{\textbf{n}}{\psi}$ and $\binom{\textbf{n}^{\prime}}{\psi^{\prime}}$ , we upper bound their ratio next:

[TABLE]

Next we will lower bound the ratio considered above.

[TABLE]

Combining both we get:

[TABLE]

For final term consider all $d$ -frequencies generated by domain elements $x$ in $d$ -sequence $\textbf{y}^{\textbf{n}}$ . Observe that during our discretization procedure all $d$ -frequencies less than $\lceil\frac{1}{\gamma}\rceil$ are never affected and we upper bound the number of $d$ -frequencies that change.

Analogous to proof in one dimension, for each $k\in[1,d]$ , the number of domain elements $x\in\mathcal{D}$ with $\textbf{f}(\textbf{y}^{\textbf{n}(k)},x)>\lceil\frac{1}{\gamma(k)}\rceil$ is less than $\gamma(k)\textbf{n}(k)$ . Further, the number of domain elements $x\in\mathcal{D}$ with $\textbf{f}(\textbf{y}^{\textbf{n}(k)},x)>\lceil\frac{1}{\gamma(k)}\rceil$ for any $k\in[1,d]$ is less than $\sum_{k=1}^{d}\gamma(k)\textbf{n}(k)$ . The previous statement upper bounds $\sum_{\{j\in[1,\textbf{e}]\}~{}|~{}\exists k\in[1,d]\text{ with }\textbf{f}_{j}(k)>\lceil\frac{1}{\gamma(k)}\rceil\}}\phi_{i}\leq\sum_{k=1}^{d}\gamma(k)\textbf{n}(k)$ . This further implies $\sum_{\{j\in[1,\textbf{e}]\}~{}|~{}\exists k\in[1,d]\text{ with }\mathrm{m}_{j}(k)>\lceil\frac{1}{\gamma(k)}\rceil\}}\phi^{\prime}_{j}\leq\sum_{k=1}^{d}\gamma(k)\textbf{n}(k)$ . Combining the previous reasoning with the fact that all $d$ -frequencies less than $\lceil\frac{1}{\gamma}\rceil$ are never changed we get the following inequality.

[TABLE]

Combining previous inequality with eq. 52, eq. 53 we have our result. ∎

Our next corollary captures the impact of discretizing both probabilities and multiplicities.

Corollary E.4 (Discretization lemma).

For any $d$ -distribution $\textbf{p}\in\Delta^{\mathcal{D},d}$ , and a $d$ -sequence $\textbf{y}^{\textbf{n}}\in\mathcal{D}^{\textbf{n}}$ . If $\textbf{q}=\mathrm{disc}(\textbf{p})$ is the discrete $d$ -distribution of p then,

[TABLE]

where $\phi=\Phi(\textbf{y}^{\textbf{n}})$ and $\phi^{\prime}=\Phi^{\prime}(\textbf{y}^{\textbf{n}})$ are the $d$ -profile and discrete $d$ -profile of $\textbf{y}^{\textbf{n}}$ respectively.

Proof.

Corollary follows immediately by combining Lemma E.2 and Lemma E.3. ∎

The discretization lemma above motivates the definition of a new objective function which we introduce and study next.

E.3 Discrete PML Optimization

Here we define a new optimization problem that can be solved efficiently and returns a $d$ -distribution which has a good approximation to the PML objective value. First we define the discrete profile maximum likelihood which is just the PML objective maximized over discrete $d$ -pseudodistributions.

Definition E.5 (Discrete profile maximum likelihood).

Let $\textbf{y}^{\textbf{n}}\in\mathcal{D}^{\textbf{n}}$ be any $d$ -sequence, $\phi=\Phi(\textbf{y}^{\textbf{n}})$ and $\phi^{\prime}=\Phi^{\prime}(\textbf{y}^{\textbf{n}})$ be its $d$ -profile and discrete $d$ -profile respectively, a Discrete Profile Maximum Likelihood (DPML) $d$ -pseudodistribution $\textbf{q}_{dpml,\phi^{\prime}}$ is:

[TABLE]

$\mathbb{P}(\textbf{q}_{dpml,\phi^{\prime}},\phi^{\prime})$ is the maximum objective value.

Corollary E.6 (DPML is an approximate PML).

For any $d$ -sequence $\textbf{y}^{\textbf{n}}\in\mathcal{D}^{\textbf{n}}$ , $~{}~{}~{}~{}~{}\mathbb{P}(\textbf{q}_{dpml,\phi^{\prime}},\phi^{\prime})\geq\exp\left(-\widetilde{O}\left(\sum_{k=1}^{d}\epsilon(k)\textbf{n}(k)+\sum_{k=1}^{d}\gamma(k)\textbf{n}(k)\right)\right)\mathbb{P}(\textbf{p}_{pml,\phi},\phi)$

Proof.

Note that $\textbf{q}_{pml,\phi}=\mathrm{disc}(\textbf{p}_{pml,\phi})$ is a discrete $d$ -pseudodistribution. The result follows from E.4 applied to $\textbf{p}_{pml,\phi}$ . ∎

In the next two lemmas we rephrase the DPML optimization problem in forms that are amenable to convex relaxation. To do this, we introduce some new notation.

$\bullet$ Let $\zeta\in\mathbb{R}^{\textbf{b}\times d}$ be the matrix with rows indexed between $1$ to b and $i$ th row is equal to $d$ -level set $\zeta_{i}\in\textbf{P}$ . Also let $\mathrm{m}\in\mathbb{R}^{(\textbf{e}+1)\times d}$ be the vector with rows indexed between [math] to e. Its zeroth row (denoted by $\mathrm{m}_{0}$ ) is equal to $d$ -frequency $(0,\dots 0)$ and $j$ th row is equal to $d$ -frequency $\mathrm{m}_{j}\in\textbf{M}$ . We use $\mathrm{m}(k)$ and $\zeta(k)$ to denote the $k$ th column of matrix $\mathrm{m}$ and $\zeta$ respectively.

$\bullet$ Let $X\in\mathbb{Z}_{+}^{\textbf{b}\times(\textbf{e}+1)}$ be a variable matrix and we use $X_{ij}$ for $i\in[1,\textbf{b}],j\in[0,\textbf{e}]$ to denote elements of this matrix. As in the case for vector $\mathrm{m}$ , our second index $j$ of variable matrix $X$ starts at [math] and not at $1$ . Here the variable $X_{ij}$ counts the number of domain elements $x\in\mathcal{D}$ with $d$ -level set $\zeta_{i}$ and have $d$ -frequency equal to $\mathrm{m}_{j}$ . $X_{i,0}$ is counting the number of domain elements $x\in\mathcal{D}$ with $d$ -level set $\zeta_{i}$ and $d$ -frequency equal to $(0,\dots 0)$ . We use function $\log\zeta(k)$ and $\log\zeta$ to perform entrywise operations returning entities of same dimension as $\zeta(k)$ and $\zeta$ respectively with $\log$ applied on every entry.

$\bullet$ For any matrix v and set $S$ , we use $\textbf{v}_{S}$ to denote the matrix with $|S|$ rows corresponding to index set $S$ .

$\bullet$ For a discrete $d$ -profile $\phi^{\prime}=(\phi^{\prime}_{j})_{j=1\dots\textbf{e}}$ (corresponding to $d$ -sequence $\textbf{y}^{\textbf{n}}$ ), define:

$~{}~{}~{}~{}\textbf{K}_{\phi^{\prime}}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\{X\in\mathbb{Z}_{+}^{\textbf{b}\times(\textbf{e}+1)}~{}\Big{|}~{}~{}(X^{T}\mathrm{1})_{[1,\textbf{e}]}=\phi^{\prime},\text{ and }(X\mathrm{1})^{T}\zeta\leq 1\}$

Note in the expression above $(X\mathrm{1})^{T}\zeta$ is a $d$ -tuple and $(X\mathrm{1})^{T}\zeta\leq 1$ means each entry of this $d$ -tuple is less than 1 (as described in the preliminaries section).

$\bullet$ For a discrete $d$ -profile $\phi^{\prime}=(\phi^{\prime}_{j})_{j=1\dots\textbf{e}}$ (of $\textbf{y}^{\textbf{n}}$ ) and a discrete $d$ -pseudodistribution q, also define:

$~{}~{}~{}~{}\textbf{K}_{\textbf{q},\phi^{\prime}}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\{X\in\mathbb{Z}_{+}^{\textbf{b}\times(\textbf{e}+1)}~{}\Big{|}~{}~{}(X^{T}\mathrm{1})_{[1,\textbf{e}]}=\phi^{\prime},\text{ and }X\mathrm{1}=\ell^{\textbf{q}}\}$ where $\ell^{\textbf{q}}\in\mathbb{R}^{\textbf{b}}$ and $\ell^{\textbf{q}}_{i}$ denote the number of domain elements with $d$ -level set $\zeta_{i}\in\textbf{P}$ in $d$ -pseudodistribution q.

One of the most important advantages of $d$ -level set and $d$ -frequency discretization we described earlier is that many $d$ -types in the set $\{\psi~{}|~{}\Phi(\psi)=\phi^{\prime}\}$ share the same probability value of being observed and our goal is to group them using the $X_{ij}$ variables. Exploiting this idea, we next give a different formulation for the DPML objective.

Lemma E.7 (DPML objective reformulation).

For any discrete $d$ -pseudodistribution $\textbf{q}\in\Delta^{\mathcal{D},d}$ and discrete $d$ -profile $\phi^{\prime}\in\Phi^{\textbf{n}^{\prime}}$ :

[TABLE]

Proof.

Recall from Equation 50

[TABLE]

For convenience, we call a $d$ -type $\psi$ valid if it belongs to set $\{\psi~{}|~{}\Phi(\psi)=\phi^{\prime}\}$ . Recall variable $X_{ij}$ represents the number of domain elements with $d$ -level set $\zeta_{i}$ and have $d$ -frequency equal to $\mathrm{m}_{j}$ . In this representation and for the discrete $d$ -pseudodistribution q, each valid $d$ -type $\psi$ corresponds to the following unique variable assignment $X\in\textbf{K}_{\textbf{q},\phi^{\prime}}$ :

[TABLE]

and from the expression above it is not hard to write the exact expression for the probability term associated with the valid $d$ -type $\psi$ :

[TABLE]

For any variable assignment $X$ , it is clear from the middle term in Equation 56 that all valid $d$ -types $\psi$ associated with $X$ share the same probability value of being observed. With this observation, it is now enough to argue about the number of valid $d$ -types associated with a variable assignment $X$ to prove our lemma. We make this argument next by constructing all valid $d$ -types associated with $X$ .

First consider all domain elements with a fixed $d$ -level set $\zeta_{i}$ and number of such elements is equal to $\sum_{j=0}^{\textbf{e}}X_{ij}$ . We can now generate part of a valid $d$ -type corresponding to the domain elements with $d$ -level set equal to $\zeta_{i}$ by picking any partition of these $\sum_{j=0}^{\textbf{e}}X_{ij}$ domain elements into groups of sizes $\{X_{ij}\}_{j\in[0,\textbf{e}]}$ . This corresponds to multinomial coefficient and therefore the number of types associated with $X$ is just:

[TABLE]

Here we only generated partial valid $d$ -types corresponding to domain elements with $d$ -level set equal to $\zeta_{i}$ . To generate a full valid $d$ -type we just need to combine these partial valid $d$ -types generated for each $d$ -level set $\zeta_{i}$ . Let $S_{X}$ denote all such full valid $d$ -types associated with a variable assignment $X$ and generating a full valid $d$ -type corresponds to groups (for each $d$ -level set $\zeta_{i}$ ) of independent possibilities considered conjointly. Further the cardinality of set $S_{X}$ is just the multiplication of cardinalities of each of these groups and is explicitly written below,

[TABLE]

We are almost done and all we do next is formally derive the expression in our lemma statement to complete the proof. From Equation 50,

[TABLE]

∎

Lemma E.8 (DPML objective relaxed).

For any $d$ -sequence $\textbf{y}^{\textbf{n}}\in\mathcal{D}^{\textbf{n}}$ , and a discrete $d$ -pseudodistribution $\textbf{q}\in\Delta^{\mathcal{D},d}$ the DPML objective can be upper bounded by:

[TABLE]

where $\phi^{\prime}=\Phi^{\prime}(\textbf{y}^{\textbf{n}})\in\Phi^{\textbf{n}^{\prime}}$ is discrete $d$ -profile of $\textbf{y}^{\textbf{n}}$ .

Proof.

The proof follows because $\textbf{K}_{\textbf{q},\phi^{\prime}}\subseteq\textbf{K}_{\phi^{\prime}}$ and invoking Lemma E.7. ∎

We are half way through in defining our final optimization problem which exhibits efficient algorithms. In our final optimization problem we just optimize over one term in the set $\textbf{K}_{\phi^{\prime}}$ instead of working with summation over all the terms and next two lemmas serve as the motivation for working with single term over the summation of terms by showing that the optimizing $d$ -pseudodistribution of our final optimization problem is still an approximate PML $d$ -distribution.

Lemma E.9 (Cardinality of $\textbf{K}_{\phi^{\prime}}$ ).

For any $d$ -sequence $\textbf{y}^{\textbf{n}}\in\mathcal{D}^{\textbf{n}}$ and its associated discrete $d$ -profile $\phi^{\prime}=\Phi^{\prime}(\textbf{y}^{\textbf{n}})$ :

[TABLE]

Proof.

$\textbf{K}_{p}$ is a set of vectors in $\mathbb{Z}_{+}^{\textbf{b}\times(\textbf{e}+1)}$ and because of Lemma E.1 combined with the constraint $(X\mathrm{1})^{T}\zeta\leq 1$ , each $X_{ij}$ takes only positive integer values less than $\min_{k\in[1,d]}2\textbf{n}(k)^{2}$ . The lemma statement follows by substituting the values of b and e. ∎

As described earlier Lemma E.9 motivates us to consider the following objective, define:

[TABLE]

It is important to note that there is a discrete $d$ -pseudodistribution $\textbf{q}_{X}$ that correspond to each variable assignment $X\in\textbf{K}_{\phi^{\prime}}$ . The description of this $d$ -distribution is as follows: For each $i\in[1,\textbf{b}]$ , the number of domain elements that have $d$ -level set $\zeta_{i}$ in q is equal to $(X\mathrm{1})_{i}$ . This description only provides non zero $d$ -level sets and also does not provide any labels, however it is sufficient for estimating all symmetric properties mentioned in this paper.

Definition E.10 (Single discrete profile maximum likelihood).

For any $d$ -sequence $\textbf{y}^{\textbf{n}}\in\mathcal{D}^{\textbf{n}}$ and its associated discrete $d$ -profile $\phi^{\prime}=\Phi^{\prime}(\textbf{y}^{\textbf{n}})\in\Phi^{\textbf{n}^{\prime}}$ , a Single Discrete Profile Maximum Likelihood (SDPML) $d$ -pseudodistribution $\textbf{q}_{sdpml,\phi^{\prime}}$ is:

[TABLE]

and $\textbf{q}_{sdpml,\phi^{\prime}}$ is the $d$ -pseudodistribution corresponding to $X_{sdpml,\phi^{\prime}}$ .

Lemma E.11 (SDPML relationd to PML).

For any $d$ -sequence $\textbf{y}^{\textbf{n}}\in\mathcal{D}^{\textbf{n}}$ ,

[TABLE]

where $\phi=\Phi(\textbf{y}^{\textbf{n}})$ and $\phi^{\prime}=\Phi^{\prime}(\textbf{y}^{\textbf{n}})$ are $d$ -profile and discrete $d$ -profile associated with $\textbf{y}^{\textbf{n}}$ .

Proof.

$~{}~{}~{}~{}~{}C_{\phi^{\prime}}\textbf{w}_{\mathrm{sdpml}}(X_{sdpml,\phi^{\prime}})\geq C_{\phi^{\prime}}\textbf{w}_{\mathrm{sdpml}}(X_{dpml,\phi^{\prime}})\geq\exp\left(-O\left(\prod_{k=1}^{d}\frac{\log^{3}\textbf{n}(k)}{\epsilon(k)\gamma(k)}\right)\right)\mathbb{P}(\textbf{q}_{dpml,\phi^{\prime}},\phi^{\prime})~{}~{}~{}~{}~{}$

[TABLE]

The second inequality follows from Lemma E.9, E.8 and last follows from E.6. ∎

E.4 Convex relaxation of SDPML

We showed in the previous subsection that the SDPML objective is a good approximation to the PML objective. However the objective function of SDPML is defined only over the integers and in this subsection we present a convex relaxation of SDPML.

First, we consider the feasible set $\textbf{K}_{\phi^{\prime}}$ of SDPML, which is the following integral polytope

[TABLE]

We relax the integer constraint on variables $X_{ij}$ :

[TABLE]

In the later subsections, we show how we deal with these fractional solutions by presenting a rounding algorithm with a good approximation ratio.

Secondly, we relax the objective function of SDPML itself. The objective of SDPML is defined only on the integral set. We next define a continuous relaxation of this objective function which is also log-concave. To do so, we use an approximation of the factorial function (similar to Stirling’s approximation) which handles $0!$ terms as well. We use the following function as the continuous proxy of the SDPML objective (using the convention that $0\log 0=0$ ):

[TABLE]

The lemma below states that continuous version is not far from the actual SDPML objective.

Lemma E.12 ( $\textbf{g}(\cdot)$ approximates SDPML objective).

For any $d$ -sequence $\textbf{y}^{\textbf{n}}\in\mathcal{D}^{\textbf{n}}$ and its associated discrete $d$ -profile $\phi^{\prime}=\Phi^{\prime}(\textbf{y}^{\textbf{n}})\in\Phi^{\textbf{n}^{\prime}}$ . If $X\in\textbf{K}_{\phi^{\prime}}$ , then

[TABLE]

Proof.

By Stirling’s approximation for all integer $n\geq 1$ :

[TABLE]

We slightly use a weaker version of this inequality that holds all integers $n\geq 0$ ,

[TABLE]

In the final inequality we used the fact that each $i\in[1,\textbf{b}]$ , $(X\mathrm{1})_{i}\leq\min_{k\in[1,d]}2\textbf{n}(k)^{2}$ (Lemma E.1 combined with the constraint $(X\mathrm{1})^{T}\zeta\leq 1$ ensures this fact) and substituted the value of b. Also,

[TABLE]

∎

A key fact about function $\textbf{g}(X)$ is that it is log-concave, so we can apply optimization machinery from convex optimization.

Lemma E.13.

Function $\textbf{g}(X)$ is log-concave in $X$ .

Proof.

Taking $\log$ on both sides of Equation 60 we get,

[TABLE]

The first term $\mathrm{tr}(\log\zeta^{T}X\mathrm{m})$ is linear in $X$ and refer Lemma C.1 for the concavity of the second term. Combining both we get, $\log\textbf{g}(X)$ is a sum of linear plus concave term and is therefore concave. Therefore, the function $\textbf{g}(X)$ is $\log$ concave. ∎

Maximizing log concave objective function $\textbf{g}(\cdot)$ over the relaxed convex set $\textbf{K}^{f}_{\phi^{\prime}}$ is a convex optimization problem and can be solved efficiently. Below is the convex relaxation of our SDPML objective which can be solved efficiently as summarized by our next theorem.

[TABLE]

Theorem E.14 (Solver for convex relaxation to SDPML).

Optimization problem 61 can solved in time $O\left(\textbf{e}^{2}\textbf{b}\log^{O(1)}(\textbf{b}\textbf{e})+\textbf{e}^{3}\log^{O(1)}(\textbf{b}\textbf{e})\right)$

Proof.

The optimization problem 61 is already in the form of optimization problem studied for one dimension in Appendix D. To invoke the result in Appendix D all we need is a lower bound on the minimum eigenvalue of matrix $\textbf{A}^{T}\textbf{A}$ , where A is the constraint matrix when the optimization problem 61 is written in the vector form (described in Appendix D). We state this constraint matrix A for the optimization problem 61 and provide lower bound on the minimum eigenvalue of matrix $\textbf{A}^{T}\textbf{A}$ in Section F.2. The number of variables in the optimization problem 61 is $\textbf{b}\times\textbf{e}$ and the number of constraint is $\textbf{e}+d\leq 2\textbf{e}$ . In this notation of Appendix D, the value of parameters $b_{1}=\textbf{b}$ and $b_{2}=\textbf{e}$ and the running time we get for the optimization problem 61 is that stated in the lemma statement. ∎

E.5 Algorithm and Runtime Analysis

In this section we give an algorithm to find a $d$ -distribution that approximates PML objective and our analysis in previous sections suggests that it suffices to find a $d$ -distribution that approximates SDPML objective, which we replaced by a convex proxy. We now present an algorithm that takes an optimal solution to this convex proxy and produces a $d$ -distribution that approximates PML objective. Recall that $\textbf{K}^{f}_{\phi^{\prime}}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\{X\in\mathbb{R}^{\textbf{b}\times(\textbf{e}+1)}~{}\big{|}~{}(X^{T}\mathrm{1})_{[1,\textbf{e}]}=\phi^{\prime},\text{ and }(X\mathrm{1})^{T}\zeta\leq 1\}$ .

The solution $X$ returned by the rounding procedure is defined on an extended discretized $d$ -probability space $\textbf{P}^{\prime}$ , where $\textbf{P}^{\prime}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\textbf{P}\cup\{\zeta_{\textbf{b}+j}\}_{j\in[1,\textbf{e}]}$ . To derive the relation between solution $X$ and PML objective value we need to extend some definitions studied earlier. First, we define $\zeta_{\mathrm{ext}}$ as the matrix whose rows are exactly the elements of $\textbf{P}^{\prime}$ and we call it the extended $d$ -level set matrix. Note we still use $\zeta_{i}$ for all $i\in[1,\textbf{b}+\textbf{e}]$ to refer rows of $\zeta_{\mathrm{ext}}$ . Further, for any $d$ -pseudodistribution q with $\textbf{q}_{x}\in\textbf{P}^{\prime}$ for all $x\in\mathcal{D}$ (we call it extended discrete $d$ -pseudodistribution) and discrete $d$ -profile $\phi^{\prime}$ , we first define following extensions of sets $\textbf{K}_{\textbf{q},\phi^{\prime}}$ and $\textbf{K}_{\phi^{\prime}}$ ,

[TABLE]

where $\ell^{\textbf{q}}\in\mathbb{R}^{\textbf{b}+\textbf{e}}$ and $\ell^{\textbf{q}}_{i}$ denote the number of domain elements with $d$ -level set $\zeta_{i}\in\textbf{P}^{\prime}$ .

Further by Lemma E.7, for any extended discrete $d$ -pseudodistribution q and a discrete $d$ -profile $\phi^{\prime}$ , the following equality holds,

[TABLE]

Similarly for any $X\in\textbf{K}^{ext}_{\textbf{q},\phi^{\prime}}$ , below are the natural extension of definitions of functions $\textbf{w}_{\mathrm{sdpml}}(\cdot)$ and $\textbf{g}(\cdot)$ ,

[TABLE]

We are now ready to analyze our rounding algorithm. First we provide some interesting properties solution $X$ returned by our rounding procedure satisfies,

Claim E.15.

The solution $X\in\mathbb{Z}_{+}^{(\textbf{b}+\textbf{e})\times(\textbf{e}+1)}$ returned by rounding procedure (2) above satisfies:

$(X^{\prime}\mathrm{1})_{i}-(\textbf{e}+1)\leq(X\mathrm{1})_{i}\leq(X^{\prime}\mathrm{1})_{i}\quad\forall i\in[1,\textbf{b}]$ ** 2. 2.

$X\in\textbf{K}^{ext}_{\phi^{\prime}}$ .

Proof.

Claims (1) follows because $X^{\prime}_{ij}-1\leq X_{ij}\leq X^{\prime}_{ij}$ for all $i\in[1,\textbf{b}],j\in[0,\textbf{e}]$ . Now note $\sum_{i=1}^{\textbf{b}+\textbf{e}}X_{ij}=\sum_{i=1}^{\textbf{b}}X^{\prime}_{ij}=\phi^{\prime}_{\mathrm{m}_{j}}\quad\forall j\in[1,\textbf{e}]$ because of the adjustments made by new level sets. Further,

[TABLE]

The final inequality follows because $X^{\prime}\in\textbf{K}^{f}_{\phi^{\prime}}$ and therefore $X\in\textbf{K}^{ext}_{\phi^{\prime}}$ and Claim (2) follows. ∎

The solution $X$ returned by (4) always belongs to $\textbf{K}^{ext}_{\phi^{\prime}}$ , further values $\textbf{w}_{\mathrm{sdpml}}(X)$ and $\textbf{g}(X)$ are close to each other and we summarize this result in our next lemma.

Lemma E.16.

For any $X\in\textbf{K}^{ext}_{\phi^{\prime}}$ returned by rounding procedure above satisfies:

[TABLE]

Proof.

For all integers $n\geq 0$ , recall the weaker version of sterlings approximation we used earlier ,

[TABLE]

Now,

[TABLE]

and

[TABLE]

Now $\textbf{P}^{\prime}=\textbf{P}\cup\{\zeta_{\textbf{b}+j}\}_{j\in[1,\textbf{e}]}$ and for any $j\in[1,\textbf{e}]$ , $\zeta_{\textbf{b}+j}$ is a convex combination of elements in P and therefore $\zeta_{\textbf{b}+j}(k)\geq\frac{1}{2\textbf{n}(k)^{2}}$ for all $k\in[1,d]$ . In the above expression we used the fact that each $i\in[1,\textbf{b}]$ , $(X\mathrm{1})_{i}\leq 2\textbf{n}(k)^{2}$ for all $k\in[1,d]$ (For any $i\in[1,\textbf{b}+\textbf{e}]$ , $\zeta_{i}(k)\geq 1/2\textbf{n}(k)^{2}$ and further combined with the constraint $\zeta_{\mathrm{ext}}^{T}X\mathrm{1}\leq 1$ (because $X\in\textbf{K}^{ext}_{\phi^{\prime}}$ ) ensures this fact). Also,

[TABLE]

In the second inequality we used the fact that solution $X$ returned by our rounding procedure always satisfies $X_{\textbf{b}+j,k}=0$ for all $j\in[1,\textbf{e}]$ , $k\in[0,\textbf{e}]$ and $k\neq j$ . ∎

Using Equation 62, for any $X\in\textbf{K}^{ext}_{\phi^{\prime}}$ , if $\textbf{q}_{X}$ is its corresponding extended discrete $d$ -pseudodistribution, then

[TABLE]

Lemma E.17.

The solution $X\in\textbf{K}^{ext}_{\phi^{\prime}}$ returned by Algorithm 4 satisfies:

[TABLE]

Proof.

For any $X^{\prime}\in\textbf{K}^{f}_{\phi^{\prime}}$ and $X\in\textbf{K}^{ext}_{\phi^{\prime}}$ returned by our rounding procedure below are the explicit expressions for $\textbf{g}(X)$ and $\textbf{g}(X^{\prime})$ :

[TABLE]

We first bound the probability term:

[TABLE]

Final expression above is the probability term associated with $X$ and the equation above shows that our rounding procedure only increases the probability term and all that matters is to bound the counting term that we do next.

[TABLE]

In the derivation above we used (1) in Claim E.15 and $(X^{\prime}\mathrm{1})_{i}\leq\min_{k\in[1,d]}2\textbf{n}(k)^{2}$ . It remains now to lower bound the quantity $\textbf{w}_{\mathrm{sdpml}}(X)$ :

[TABLE]

The first and second inequality follow from Lemma E.16 and Equation 66 respectively. In the third inequality we used $\textbf{g}(X^{\prime})\geq\textbf{g}(X_{sdpml})$ because $X^{\prime}$ is the optimal solution over the relaxed constraint set $\textbf{K}^{f}_{\phi^{\prime}}$ and finally invoked Lemma E.12 to relate $\textbf{w}_{\mathrm{sdpml}}$ and g. ∎

Now construct the $d$ -pseudodistribution $\textbf{q}_{X}$ corresponding to the solution $X$ returned by Algorithm 4 by assigning $(X\mathrm{1})_{i}$ elements to $d$ -level set $\zeta_{i}$ $(\forall i\in[\textbf{b}+\textbf{e}])$ . Our next theorem proves that the $d$ -distribution $\frac{\textbf{q}_{X}}{\|\textbf{q}_{X}\|_{1}}$ is an approximate PML $d$ -distribution.

Theorem E.18 (Efficient and approximate PML for higher dimension).

Let $d$ be a constant and $\textbf{y}^{\textbf{n}}$ be a $d$ -sequence of $d$ -length $\textbf{n}=(\textbf{n}(1),\dots,\textbf{n}(d))$ . Let $\epsilon,\gamma\in\mathbb{R}^{1\times d}$ be $d$ -tuples such that for each $k\in[1,d]$ , $\frac{1}{poly(\textbf{n}(k))}<\epsilon(k)<1$ , $\frac{1}{poly(\textbf{n}(k))}<\gamma(k)<1$ , we can compute an $\exp(-\widetilde{O}\left(\sum_{k=1}^{d}\epsilon(k)\textbf{n}(k)+\sum_{k=1}^{d}\gamma(k)\textbf{n}(k)+\prod_{k=1}^{d}\frac{1}{\epsilon(k)\gamma(k)}\right))$ -approximate PML $d$ -distribution $\textbf{p}_{approx}$ in time $\widetilde{O}\left(\sum_{k=1}^{d}\textbf{n}(k)+\prod_{k=1}^{d}\frac{1}{\epsilon(k)(\gamma(k))^{2}}+\prod_{k=1}^{d}\frac{1}{(\gamma(k))^{3}}\right)$ .

Proof.

Let $\textbf{q}_{X}$ be the $d$ -pseudodistribution corresponding to solution $X$ returned by Algorithm 4. Set $\textbf{p}_{approx}=\frac{\textbf{q}_{X}}{\|\textbf{q}_{X}\|_{1}}$ , then:

[TABLE]

The first inequality follows because $\|\textbf{q}_{X}\|_{1}\leq 1$ , second inequality from Lemma E.3, third inequality follows because $X\in\textbf{K}^{ext}_{\textbf{q}_{X},\phi^{\prime}}$ (because we constructed $\textbf{q}_{X}$ from $X$ ) and $\textbf{w}_{\mathrm{sdpml}}(X)$ computes just one term in the summation over $\textbf{K}^{ext}_{\textbf{q}_{X},\phi^{\prime}}$ (look at the representation of $\mathbb{P}(\textbf{q}_{X},\phi^{\prime})$ as summation over $\textbf{K}^{ext}_{\textbf{q}_{X},\phi^{\prime}}$ from Equation 64), fourth inequality comes from Lemma E.17 and last inequality follows from Lemma E.11.

The total running time of our algorithms is the following: Given a $d$ -sequence $\textbf{y}^{\textbf{n}}$ , it takes $\widetilde{O}(\sum_{k=1}^{d}\textbf{n}(k)+\prod_{k=1}^{d}\frac{1}{\gamma(k)})$ to write down the discrete $d$ -profile $\phi^{\prime}$ , then we need to solve the convex optimization problem 61 which further takes $\widetilde{O}\left(\prod_{k=1}^{d}\frac{1}{\epsilon(k)(\gamma(k))^{2}}+\prod_{k=1}^{d}\frac{1}{(\gamma(k))^{3}}\right)$ and our final rounding algorithm can be implemented in time $\widetilde{O}(d\prod_{k=1}^{d}\frac{1}{\epsilon(k)\gamma(k)})$ ( $=O(d\textbf{b}\textbf{e})$ ). The total running time combining all three steps in summarized in the lemma statement. ∎

To simplify the expression, for each $k\in[1,d]$ substitute $\epsilon(k)=\gamma(k)=\textbf{n}(k)^{-1/(2d+1)}$ in the theorem above and in this parameter setting we achieve our best possible approximation ratio. See 3.5

E.6 Optimal sample complexity for KL divergence

In this section we study the connection between optimal estimation of KL divergence and approximate PML $d$ -distribution. We restate theorem of [ADOS16] we use earlier in one dimensional PML in terms of higher dimensional case.

Theorem E.19 (Theorem 4 of [ADOS16]).

For a symmetric property f, suppose there is an estimator $\hat{\textbf{f}}:\Phi^{\textbf{n}}\rightarrow\mathbb{R}$ , such that for any p $d$ -distribution and observed $d$ -profile $\phi$ ,

[TABLE]

any $\beta$ -approximate PML distribution satisfies:

[TABLE]

Let p be a $2$ -distribution, meaning it is $2$ dimensional with two distributions $\textbf{p}(1)$ and $\textbf{p}(2)$ . Let $B$ be such that, $\forall x\in\mathcal{D}$ , $\frac{\textbf{p}(1)_{x}}{\textbf{p}(2)_{x}}\leq B$ . We next define two conditions under which we get the optimal samples complexity for estimating KL divergence of distributions $\textbf{p}(1)$ and $\textbf{p}(2)$ . $\bullet$ C1 $\epsilon$ , the estimation error satisfies $\epsilon>\frac{\log^{3}N}{N}$ . $\bullet$ C2 $B\leq\epsilon^{2.24}N^{0.24}$ .

Lemma E.20 (Theorem 5 of [Ach18]).

Suppose C1 and C2 hold. Let $\alpha>0$ be a fixed (small) constant. There are constant $c_{1}$ and $c_{2}$ such that if $\textbf{n}=(\textbf{n}(1),\textbf{n}(2))$

[TABLE]

Given $\textbf{n}(1)$ independent samples $\textbf{y}^{\textbf{n}{(1)}}$ from distribution $\textbf{p}(1)$ and $\textbf{n}(2)$ independent samples $\textbf{y}^{\textbf{n}(2)}$ from distribution $\textbf{p}(2)$ , there exists an estimator $\hat{f}$ for estimating KL divergence $KL(\textbf{p}(1),\textbf{p}(2))$ that satisfies,

[TABLE]

Theorem E.21 ([Das],[BPA97]).

Let $d>1$ , and $\textbf{n}=(\textbf{n}{(1)},\dots,\textbf{n}{(d)})$ . The number of $d$ -profiles of $d$ -length equal to n is upper bounded by

[TABLE]

See 3.7

Proof.

Invoke Lemma E.20 with $\alpha=0.01$ and E.19 with $\delta=\exp\left(-2\epsilon^{2}\min\{\textbf{n}(1),\textbf{n}(2)\}^{0.98}\right)$ we get:

[TABLE]

In the first inequality we use Theorem E.21. ∎

Appendix F Remaining proofs for multidimensional PML

F.1 Minimum Probability

In this section we provide the proof for our first technical lemma which states that one can assume the minimum non-zero probability of the PML distribution is $\Omega(\frac{1}{\textbf{n}(k^{\prime})^{2}})$ by only loosing a constant factor in the PML objective value. To show such a result we use an independent rounding algorithm described in the lemma below.

Claim F.1.

For any non-negative and non-zero $d$ -vector v and a $d$ -profile $\phi\in\Phi^{n}$ ,

[TABLE]

Proof.

[TABLE]

∎

For notational convenience we need the following definition of K-profile maximum likelihood $d$ -distribution.

Definition F.2.

For any set $\textbf{K}\subset[1,d]$ , $d$ -distribution r and profile $\phi\in\Phi^{\textbf{n}}$ , the $(\textbf{K},\textbf{r})$ -profile maximum likelihood $d$ -distribution denote by $\textbf{p}^{*}_{\textbf{K},\textbf{r},\phi}$ is,

[TABLE]

Lemma F.3.

For any set $\textbf{K}\subset[1,d]$ , $d$ -distribution r, index $k^{\prime}\notin\textbf{K}$ and profile $\phi\in\Phi^{\textbf{n}}$ , there exists a $d$ -distribution $\textbf{p}^{\prime\prime}\in\Delta^{\mathcal{D},d}$ such that,

[TABLE]

Proof.

We do independent rounding to show the existence of such a solution. For notational convenience let $\textbf{p}^{*}=\textbf{p}^{*}_{\textbf{K},\textbf{r},\phi}$ and for $k^{\prime}\in[1,d]$ define $\textbf{S}_{k^{\prime}}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\{x\in\mathcal{D}~{}|~{}\textbf{p}^{*}(k^{\prime})_{x}<\frac{1}{\textbf{n}(k)^{2}}\}$ and we fix all the probability values in these sets next.

For all $x\in\textbf{S}_{k^{\prime}}$ define a random variable $Y_{x}$ as follows:

[TABLE]

Clearly $\forall x\in S$ ,

[TABLE]

and in general for any integer power $i$ of random variable $Y_{x}$ we have:

[TABLE]

For the remaining $x\in{\bar{\textbf{S}}}_{k^{\prime}}$ ( ${\bar{\textbf{S}}}_{k^{\prime}}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\mathcal{D}\backslash\textbf{S}$ ) with $\textbf{p}^{*}(k^{\prime})_{x}\geq\frac{1}{\textbf{n}(k^{\prime})^{2}}$ we define:

[TABLE]

Define $\textbf{Y}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}(Y_{x})_{x\in\textbf{S}}$ and $\textbf{Z}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}(Z_{x})_{x\in{\bar{\textbf{S}}}_{k^{\prime}}}$ .

[TABLE]

Define p as follows:

[TABLE]

where $(\textbf{Y},\textbf{Z})$ is the concatenation of random vectors Y and Z. All random variables $Y_{x},Z_{x}$ are mutually independent and we have:

[TABLE]

(From Equation 67,68 and the fact that $Z_{x}$ is a constant).

We have a lower bound on the expected value of $\mathbb{P}(\textbf{p},\phi)$ but this is misleading since p may not be a $d$ -distribution as $\|\textbf{p}(k^{\prime})\|_{1}$ could be greater than 1. Scaling norm of $\textbf{p}(k^{\prime})$ to 1 could significantly reduce the value of $\mathbb{P}(\textbf{p},\phi)$ if $\|\textbf{p}(k^{\prime})\|_{1}$ is large. However, we show that a constant fraction of the expectation of $\mathbb{P}(\textbf{p},\phi)$ comes from the sample space with bounded $\|\textbf{p}(k^{\prime})\|_{1}\leq 1+\frac{c}{\textbf{n}(k^{\prime})}$ . Here $c$ is a constant and assume $c\geq 3$ . Note that:

[TABLE]

The last inequality follows because Z is a constant random vector.

[TABLE]

To argue that a constant fraction of the expectation comes from the sample space with small $\|\textbf{p}\|_{1}$ we need a tight upper bound for:

[TABLE]

For $t\geq c$ , we first upper bound the probability term:

[TABLE]

We will use Chernoff bounds here and to apply them, we convert the $Y_{x}$ random variables into $\{0,1\}$ Bernoulli random variables. Define $\forall x\in\textbf{S}_{k^{\prime}}$ ,

[TABLE]

Equivalently:

[TABLE]

Define $\textbf{Y}^{\prime}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}(Y^{\prime}_{x})_{x\in\textbf{S}_{k^{\prime}}}$ and $\mu^{\prime}_{\textbf{S}}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\mathbb{E}\left[\|\textbf{Y}^{\prime}\|_{1}\right]=\textbf{n}(k^{\prime})^{2}\mu_{\textbf{S}}\leq\textbf{n}(k^{\prime})^{2}$ . For any $t>0$ ,

[TABLE]

Since $\|\textbf{Y}^{\prime}\|_{1}$ is a sum of Bernoulli random variables, by Chernoff bounds:

[TABLE]

Note $\|\textbf{p}(k)\|_{1}=1$ for all $k\neq k^{\prime}$ and further applying F.1 we get:

[TABLE]

Substituting back in Equation 69 we have (for $c\geq 3$ ),

[TABLE]

The above inequality implies existence of a $\textbf{p}^{\prime}$ with $\mathbb{P}(\textbf{p}^{\prime},\phi)\geq\frac{1}{4}\mathbb{P}(\textbf{p}^{*},\phi)$ and $\|\textbf{p}^{\prime}\|_{1}\leq 1+\frac{c}{\textbf{n}(k^{\prime})}$ . Define $\textbf{p}^{\prime\prime}$ ,

[TABLE]

The above inequality further implies,

[TABLE]

In the final inequality substitute $c=3$ and observe $\frac{\exp\left(-c\right)}{4}\geq\exp\left(-6\right)$ . Also our rounding procedure always ensures that minimum non-zero entry of $\textbf{p}^{\prime}$ is $\geq\frac{1}{\textbf{n}(k^{\prime})^{2}}$ that further implies a lower bound on the minimum non-zero probability value of $\textbf{p}^{\prime\prime}$ to be $\frac{1}{\textbf{n}(k^{\prime})^{2}}\frac{1}{\|\textbf{p}^{\prime}\|_{1}}=\frac{1}{\textbf{n}(k^{\prime})^{2}}\frac{1}{1+3/\textbf{n}(k^{\prime})}\geq\frac{1}{2\textbf{n}(k^{\prime})^{2}}$ . Hence $\textbf{p}^{\prime\prime}$ is our final distribution satisfying the conditions of lemma. ∎

See E.1

Proof.

The Lemma follows by induction and call to Lemma F.3.

Induction statement: For $i\in[1,d]$ , let $\textbf{p}^{(i)}$ be the $d$ -distribution satisfying $\min_{x\in\mathcal{D}:\textbf{p}^{(i)}(k)_{x}\neq 0}\textbf{p}^{(i)}_{x}(k)\geq\frac{1}{2\textbf{n}(k)^{2}}$ for all $k\leq i$ and is a $\exp\left(-6i\right)$ -approximate PML $d$ -distribution.

Base Case: Apply Lemma F.3 by setting $\textbf{K}=\{\}$ an empty set, $\textbf{r}=\textbf{p}_{pml,\phi}$ and $k^{\prime}=1$ . Note that $\textbf{p}^{*}_{\textbf{K},\textbf{r},\phi}=\textbf{p}_{pml,\phi}$ and the $d$ -distribution returned by Lemma F.3 is $\exp\left(-6i\right)$ -approximate PML $d$ -distribution.

Induction step for $i+1$ : Apply Lemma F.3 by setting $\textbf{K}=\{[1,i]\}$ , $\textbf{r}=\textbf{p}^{(i)}$ and $k^{\prime}=i+1$ . Note that $\mathbb{P}(\textbf{p}^{*}_{\textbf{K},\textbf{r},\phi},\phi)\geq\mathbb{P}(\textbf{p}^{(i)},\phi)\geq\exp\left(-6i\right)\mathbb{P}(\textbf{p}_{pml,\phi},\phi)$ (By induction step) and the $d$ -distribution returned by Lemma F.3 $\textbf{p}^{(i+1)}$ further satisfies $\mathbb{P}(\textbf{p}^{(i+1)},\phi)\geq\exp\left(-6\right)\mathbb{P}(\textbf{p}^{*}_{\textbf{K},\textbf{r},\phi},\phi)\geq\exp\left(-6(i+1)\right)\mathbb{P}(\textbf{p}_{pml,\phi},\phi)$ and is therefore a $\exp\left(-6(i+1)\right)$ -approximate PML $d$ -distribution. Also by Lemma F.3 $\textbf{p}^{(i+1)}(k)=\textbf{p}^{(i)}(k)$ for all $k\leq i$ and $\min_{x\in\mathcal{D}:\textbf{p}^{(i+1)}_{x}\neq 0}\textbf{p}^{(i+1)}_{x}\geq\frac{1}{2\textbf{n}(i+1)^{2}}$ . Combining everything we satisfy induction step for $i+1$ .

Set $\textbf{p}^{\prime\prime}=\textbf{p}^{(d)}$ and by induction we get that induction step holds for $i=d$ and the lemma statement follows. ∎

F.2 Eigenvalue bounds for Gram matrix

Here we provide a lower bound for the minimum eigenvalue of a invertible Gram matrix. First, in Lemma F.4 we provide an explicit expression for the trace of inverse of a Gram matrix. Then, leveraging that $\lambda_{\min}(\textbf{G})\geq 1/\mathrm{tr}(\textbf{G}^{-1})$ we obtain Corollary F.5, our desired lower bound.

Lemma F.4.

For an invertible Gram matrix $\textbf{G}\in\mathbb{R}^{d\times d}$ of a set of vectors $\textbf{v}_{1},\dots,\textbf{v}_{d}\in\mathbb{R}^{\textbf{b}}$ .

[TABLE]

where $\tilde{\textbf{v}}_{k}$ is the orthogonal projection of $\textbf{v}_{k}$ onto $span(\textbf{v}_{1},\dots,\textbf{v}_{k-1},\textbf{v}_{k+1},\dots,\textbf{v}_{d})^{\perp}$ .

Proof.

Recall,

[TABLE]

Let $\textbf{V}\in\mathbb{R}^{\textbf{b}\times d}$ be the matrix with columns $\textbf{v}_{1},\dots,\textbf{v}_{d}$ . For each $k\in[1,d]$ we next give explicit formula for scalar $(\textbf{G}^{-1})_{kk}$ . Let $\textbf{V}_{k}\in\mathbb{R}^{\textbf{b}\times(d-1)}$ be the matrix with $k$ th column removed from matrix V. From the definition of $\textbf{G}^{-1}$ and for all $k\in[1,d]$ , the $k$ ’th diagonal entry of $\textbf{G}^{-1}$ is given by:

[TABLE]

Using Theorem (3) combined with Equation (3.2) in [Rot] we get,

[TABLE]

The lemma statement follows by substituting value of $(\textbf{G}^{-1})_{kk}$ in Equation 72. ∎

Corollary F.5.

For an invertible Gram matrix $\textbf{G}\in\mathbb{R}^{d\times d}$ of a set of vectors $\textbf{v}_{1},\dots,\textbf{v}_{d}\in\mathbb{R}^{\textbf{b}}$ .

[TABLE]

where $\tilde{\textbf{v}}_{k}$ is the orthogonal projection of $\textbf{v}_{k}$ onto $span(\textbf{v}_{1},\dots,\textbf{v}_{k-1},\textbf{v}_{k+1},\dots,\textbf{v}_{d})^{\perp}$ .

F.3 Singular value lower bound for constraint matrix

Here we show a lower bound for the minimum singular value of our constraint matrix A for multidimensional PML. First in Lemma F.6, we give a lower bound on the norm of orthogonal projection of each column onto span of remaining columns for the $d$ -level set matrix $\zeta$ (defined in Section E.2). This result combined with F.5 gives a lower bound for the minimum singular value for $\zeta$ . Then in Lemma F.8, we lower bound the minimum singular value of A in terms of minimum singular value of $\zeta$ to achieve our desired lower bound.

Now, recall that P is the set of all vectors $x\in\mathbb{R}^{d}$ where $x(k)=(1+\epsilon(k))^{1-j}$ for some $j\in[1,\textbf{b}_{k}]$ , where $\textbf{b}_{k}$ for each $k\in[1,\textbf{b}_{k}]$ is such that $(1+\epsilon(k))^{1-\textbf{b}_{k}}\leq\frac{1}{2\textbf{n}(k)^{2}}$ and $\textbf{b}=\prod_{k=1}^{d}\textbf{b}_{k}$ . Further, the $d$ -level set matrix $\zeta\in\mathbb{R}^{\textbf{b}\times d}$ is the defined as the matrix whose rows are exactly the elements of P.

Lemma F.6.

For $\zeta\in\mathbb{R}^{\textbf{b}\times d}$ and $k\in[1,d]$ , if $\zeta(k)$ is its $k$ ’th column, then the following inequality holds,

[TABLE]

where $\tilde{\zeta}(k)$ is the orthogonal projection of $\zeta(k)$ onto $span(\zeta(1),\dots,\zeta(k-1),\zeta(k+1),\dots,\zeta(d))^{\perp}$ .

Proof.

For each index $k\in[1,d]$ , there are multiple blocks each of size $\textbf{b}_{k}$ and for each $k_{i}$ th block $I_{k_{i}}\subset[1,\textbf{b}]$ and $k^{\prime}\in[1,d]$ and $k^{\prime}\neq k$ ,

[TABLE]

for each scalar $c_{k^{\prime}}\in\{\frac{1}{2\textbf{n}(k^{\prime})^{2}},\dots\frac{1}{2\textbf{n}(k^{\prime})^{2}}(1+\epsilon(k^{\prime}))^{i},\dots 1\}$ and the number of blocks satisfying above equalities is equal to $\prod_{k^{\prime}\in[1,d]|k^{\prime}\neq k}\textbf{b}_{k^{\prime}}$ .

Note $span(\zeta_{I_{k_{i}}}(1),\dots,\zeta_{I_{k_{i}}}(k-1),\zeta_{I_{k_{i}}}(k+1),\dots,\zeta_{I_{k_{i}}}(d))^{\perp}$ is same as $span(\mathrm{1}_{\textbf{b}_{k}})^{\perp}$ and if $\tilde{\zeta}_{I_{k_{i}}}(k)$ is the orthogonal projection of $\zeta_{I_{k_{i}}}(k)$ onto $span(\zeta_{I_{k_{i}}}(1),\dots,\zeta_{I_{k_{i}}}(k-1),\zeta_{I_{k_{i}}}(k+1),\dots,\zeta_{I_{k_{i}}}(d))^{\perp}=span(\mathrm{1}_{\textbf{b}_{k}})^{\perp}$ , then:

[TABLE]

The above result combined with number of such blocks gives:

[TABLE]

∎

Corollary F.7.

The minimum eigenvalue of matrix $\zeta^{T}\zeta$ is at least $\Omega(\textbf{b}\frac{1}{\sum_{k\in[1,d]}\log^{2}\textbf{n}(k)})$ .

Now lets consider our constraint matrix $\textbf{A}\in\mathbb{R}^{(\textbf{e}+1+d)\times\textbf{b}\cdot(\textbf{e}+1)}$ for multidimensional PML, 111111Our matrix A is a sparse matrix and matrix vector product with it can be computed in time $O(\textbf{b}\cdot\textbf{e})$

[TABLE]

Lemma F.8.

The eigenvalues of matrix $\textbf{A}\textbf{A}^{\top}$ are at least $\Omega(\frac{\textbf{b}}{\textbf{e}})$ .

Proof.

Direct calculation shows that if $\vec{1}_{\textbf{e}}\in\mathbb{R}^{\textbf{e}}$ , $\vec{1}_{\textbf{b}}\in\mathbb{R}^{\textbf{b}}$ are e,b dimensional all ones vector respectively and $\textbf{I}_{\textbf{e}}\in\mathbb{R}^{\textbf{e}\times\textbf{e}}$ is the e-dimensional identity matrix then for all $x\in\mathbb{R}^{\textbf{e}}$ and $\alpha\in\mathbb{R}^{d}$ we have

[TABLE]

Consequently $v=(x,\alpha)^{T}$ is an eigenvector of $\textbf{A}\textbf{A}^{\top}$ with eigenvalue $\lambda$ if and only if

[TABLE]

Now if $x\perp\vec{1}_{\textbf{e}}$ then we see the $v$ is an eigenvector if and only if $\alpha\perp\zeta^{T}\vec{1}_{\textbf{b}}$ in which case the eigenvalues are b. On the other hand if $x=\vec{1}_{\textbf{e}}$ then we see $v$ is an eigenvector of eigenvalue $\lambda$ if and only if

[TABLE]

When this happens we either have $\lambda\geq(\textbf{e}+1)\lambda_{\min}(\zeta^{T}\zeta)$ or in the case of $\lambda<(\textbf{e}+1)\lambda_{\min}(\zeta^{T}\zeta)$ the following holds,

[TABLE]

To simplify the expression above, let the following be the SVD for $\zeta$ ,

[TABLE]

where $\sigma_{1}\leq\sigma_{2}\dots\leq\sigma_{d}$ are singular values and $\sigma_{1}^{2}=\lambda_{\min}(\zeta^{T}\zeta)$ . In this notation the eigenvalue decomposition of matrix $\zeta(\lambda\textbf{I}_{d}-(\textbf{e}+1)\zeta^{T}\zeta)^{-1}\zeta^{T}$ is equal to:

[TABLE]

Further we can write closed form expression for $\lambda$ in terms of singular values and left singular value vectors of matrix $\zeta$ .

[TABLE]

We use $\textbf{h}(\lambda)$ to denote the expression on the left hand side,

[TABLE]

We know that $\lambda\geq 0$ because $\textbf{A}\textbf{A}^{T}$ is PSD. For $\lambda\in[0,(\textbf{e}+1)\lambda_{\min}(\zeta^{T}\zeta))$ , $\textbf{h}(\lambda)>0$ and is strictly increasing in $\lambda$ . Further Equation 73 has a unique solution $\lambda^{*}$ (if a solution exists) in the interval $[0,(\textbf{e}+1)\lambda_{\min}(\zeta^{T}\zeta))$ .

To give a lower bound of $\ell$ on $\lambda^{*}$ , if suffices to find a $\lambda$ , such that $\textbf{h}(\lambda)<\textbf{b}$ and we get $\ell\geq\lambda$ . For $\lambda=\min(\frac{1}{2}\sigma_{1}^{2},\frac{\textbf{b}}{2(2\textbf{e}+1)})$ , we have $\textbf{e}\sum_{k=1}^{d}\frac{\sigma_{k}^{2}(\textbf{u}_{k}^{T}\vec{1}_{\textbf{b}})^{2}}{(\textbf{e}+1)\sigma_{k}^{2}-\lambda}\leq\textbf{b}\frac{\textbf{e}}{\textbf{e}+1/2}$ , then later combined with $\lambda\leq\frac{\textbf{b}}{2(2\textbf{e}+1)}$ , we get $\textbf{h}(\lambda)<\textbf{b}$ and therefore $\lambda^{*}\geq\min(\frac{1}{2}\sigma_{1}^{2},\frac{\textbf{b}}{2(2\textbf{e}+1)})$ . Combining all cases together we have that $\lambda_{min}(\textbf{A}\textbf{A}^{T})\geq\min(\textbf{b},(\textbf{e}+1)\lambda_{min}(\zeta^{T}\zeta),\frac{1}{2}\sigma_{1}^{2},\frac{\textbf{b}}{2(2\textbf{e}+1)})$ . Combined with F.7 we have our result.

∎

Bibliography32

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[Ach 18] Jayadev Acharya. Profile maximum likelihood is optimal for estimating kl divergence. 2018 IEEE International Symposium on Information Theory (ISIT) , pages 1400–1404, 2018.
2[ADM + 10] J. Acharya, H. Das, H. Mohimani, A. Orlitsky, and S. Pan. Exact calculation of pattern probabilities. In 2010 IEEE International Symposium on Information Theory , pages 1498–1502, June 2010.
3[ADOS 16] Jayadev Acharya, Hirakendu Das, Alon Orlitsky, and Ananda Theertha Suresh. A unified maximum likelihood approach for optimal distribution property estimation. Co RR , abs/1611.02960, 2016.
4[AOST 14] Jayadev Acharya, Alon Orlitsky, Ananda Theertha Suresh, and Himanshu Tyagi. The complexity of estimating rényi entropy. In Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms , 2014.
5[BPA 97] D. P. Bhatia, M. A. Prasad, and D. Arora. Asymptotic results for the number of multidimensional partitions of an integer and directed compact lattice animals. Journal of Physics A Mathematical General , 30:2281–2285, April 1997.
6[BZLV 16] Y. Bu, S. Zou, Y. Liang, and V. V. Veeravalli. Estimation of kl divergence between large-alphabet distributions. In 2016 IEEE International Symposium on Information Theory (ISIT) , pages 1118–1122, July 2016.
7[Das] Hirakendu Das. "competitive tests and estimators for properties of distributions", ph.d. dissertation, ucsd, 2012. https://pqdtopen.proquest.com/doc/1009080587.html?FMT=ABS .
8[ET 76] Bradley Efron and Ronald Thisted. Estimating the number of unsen species: How many words did shakespeare know? Biometrika , 63(3):435–447, 1976.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Efficient Profile Maximum Likelihood for

Abstract

1 Introduction

1.1 Overview of approach

1.2 Related work

1.3 Paper organization

2 Preliminaries

Definition 2.1** (Type).**

Definition 2.2** (Profile).**

Definition 2.3** (Profile maximum likelihood).**

Definition 2.4** (Approximate PML).**

2.1 Representation of a profile

3 Results

Theorem 3.1** (Efficient and approximate PML distribution).**

Corollary 3.2** (Nearly linear time exp⁡(−O(n2/3log⁡3n))\exp(-O(n^{2/3}\log^{3}n))exp(−O(n2/3log3n))- approximate PML distribution).**

Theorem 3.3** (Universal estimator using approximate PML).**

Theorem 3.4** (Efficient universal estimator using approximate PML).**

3.1 Results for multidimensional PML

Multidimensional setup:

ddd-Profile:

Profile maximum likelihood:

Approximate profile maximum likelihood:

Theorem 3.5** (Efficient and approximate multidimensional PML).**

Corollary 3.6** (Efficient and approximate PML for two dimensions).**

Theorem 3.7** (Optimal sample complexity for KL divergence).**

Corollary 3.8** (Efficient estimator for KL divergence).**

4 Existence of Structured Approximate PML for One Dimension

Lemma 4.1** (Minimum probability lemma).**

Proof.

4.1 Probability discretization

Definition 4.2** (Pseudo-distribution).**

Definition 4.3** (Discrete pseudo-distribution).**

Lemma 4.4** (Probability discretization lemma).**

Proof.

4.2 Multiplicity discretization

Definition 4.5** (Discrete profile).**

Note:

Lemma 4.6** (Profile discretization lemma).**

Proof.

Corollary 4.7** (Discretization lemma).**

4.3 Discrete PML Optimization

Definition 4.8** (Discrete profile maximum likelihood).**

Corollary 4.9** (DPML is an approximate PML).**

Proof.

Lemma 4.10** (DPML objective reformulation).**

Proof.

Lemma 4.11** (DPML objective relaxed).**

Proof.

Lemma 4.12** (Cardinality of Kϕ′\textbf{K}_{\phi^{\prime}}Kϕ′​).**

Proof.

Definition 4.13** (Single discrete profile maximum likelihood).**

Lemma 4.14** (SDPML relation to PML).**

Proof.

4.4 Convex relaxation of SDPML

Lemma 4.15** (g(⋅)\textbf{g}(\cdot)g(⋅) approximates SDPML objective).**

Proof.

Lemma 4.16**.**

Proof.

Theorem 4.17** (Solver for convex relaxation to SDPML).**

Proof.

4.5 Algorithm and runtime analysis

Claim 4.18**.**

Proof.

Lemma 4.19**.**

Proof.

Lemma 4.20**.**

Proof.

Proof.

5 Unified optimal sample complexity for symmetric properties

Theorem 5.1** (Theorem 4 of [ADOS16]).**

Lemma 5.2** (Lemma 2 of [ADOS16]).**

Lemma 5.3** ([HR18]).**

Proof.

Appendix A Minimum Probability

Definition 2.1 (Type).

Definition 2.2 (Profile).

Definition 2.3 (Profile maximum likelihood).

Definition 2.4 (Approximate PML).

Theorem 3.1 (Efficient and approximate PML distribution).

Corollary 3.2 (Nearly linear time $\exp(-O(n^{2/3}\log^{3}n))$ - approximate PML distribution).

Theorem 3.3 (Universal estimator using approximate PML).

Theorem 3.4 (Efficient universal estimator using approximate PML).

$d$ -Profile:

Theorem 3.5 (Efficient and approximate multidimensional PML).

Corollary 3.6 (Efficient and approximate PML for two dimensions).

Theorem 3.7 (Optimal sample complexity for KL divergence).

Corollary 3.8 (Efficient estimator for KL divergence).

Lemma 4.1 (Minimum probability lemma).

Definition 4.2 (Pseudo-distribution).

Definition 4.3 (Discrete pseudo-distribution).

Lemma 4.4 (Probability discretization lemma).

Definition 4.5 (Discrete profile).

Lemma 4.6 (Profile discretization lemma).

Corollary 4.7 (Discretization lemma).

Definition 4.8 (Discrete profile maximum likelihood).

Corollary 4.9 (DPML is an approximate PML).

Lemma 4.10 (DPML objective reformulation).

Lemma 4.11 (DPML objective relaxed).

Lemma 4.12 (Cardinality of $\textbf{K}_{\phi^{\prime}}$ ).

Definition 4.13 (Single discrete profile maximum likelihood).

Lemma 4.14 (SDPML relation to PML).

Lemma 4.15 ( $\textbf{g}(\cdot)$ approximates SDPML objective).

Lemma 4.16.

Theorem 4.17 (Solver for convex relaxation to SDPML).

Claim 4.18.

Lemma 4.19.

Lemma 4.20.

Theorem 5.1 (Theorem 4 of [ADOS16]).

Lemma 5.2 (Lemma 2 of [ADOS16]).

Lemma 5.3 ([HR18]).

Claim A.1.

Definition B.1 (Discrete type).

Lemma C.1.

Lemma D.1.

Corollary D.2.

Lemma D.3.

Lemma D.4.

Lemma D.5.

Corollary D.6.

Lemma D.7.

Lemma D.8.

Lemma D.9.

Definition D.10.

Lemma D.11.

Theorem D.12 (Theorem 56 from [LSW15b]).

Theorem D.13.

E.1 Preliminaries for $d$ -dimensional objects

Lemma E.1 (Minimum probability lemma).

Lemma E.2 (Probability discretization lemma).

Lemma E.3 (Profile discretization lemma).

Corollary E.4 (Discretization lemma).

Definition E.5 (Discrete profile maximum likelihood).

Corollary E.6 (DPML is an approximate PML).

Lemma E.7 (DPML objective reformulation).

Lemma E.8 (DPML objective relaxed).

Lemma E.9 (Cardinality of $\textbf{K}_{\phi^{\prime}}$ ).

Definition E.10 (Single discrete profile maximum likelihood).

Lemma E.11 (SDPML relationd to PML).

Lemma E.12 ( $\textbf{g}(\cdot)$ approximates SDPML objective).

Lemma E.13.

Theorem E.14 (Solver for convex relaxation to SDPML).