Towards Testing Monotonicity of Distributions Over General Posets

Maryam Aliakbarpour; Themis Gouleakis; John Peebles; Ronitt Rubinfeld,; Anak Yodpinyanee

arXiv:1907.03182·cs.DS·July 9, 2019

Towards Testing Monotonicity of Distributions Over General Posets

Maryam Aliakbarpour, Themis Gouleakis, John Peebles, Ronitt Rubinfeld,, Anak Yodpinyanee

PDF

Open Access

TL;DR

This paper investigates the sample complexity of testing distribution monotonicity over general posets, introducing a new property called bigness, and establishing lower bounds and tools for upper bounds in various poset structures.

Contribution

It introduces the concept of bigness for distributions, derives lower bounds for testing monotonicity, and provides tools for analyzing upper bounds in general posets.

Findings

01

Lower bound of Ω(n/log n) for testing bigness.

02

Lower bounds for testing monotonicity over specific posets.

03

Sublinear sample complexity bounds for certain cases.

Abstract

In this work, we consider the sample complexity required for testing the monotonicity of distributions over partial orders. A distribution $p$ over a poset is monotone if, for any pair of domain elements $x$ and $y$ such that $x ⪯ y$ , $p (x) \leq p (y)$ . To understand the sample complexity of this problem, we introduce a new property called bigness over a finite domain, where the distribution is $T$ -big if the minimum probability for any domain element is at least $T$ . We establish a lower bound of $Ω (n / lo g n)$ for testing bigness of distributions on domains of size $n$ . We then build on these lower bounds to give $Ω (n / lo g n)$ lower bounds for testing monotonicity over a matching poset of size $n$ and significantly improved lower bounds over the hypercube poset. We give sublinear sample complexity bounds for testing bigness and for testing monotonicity over the matching…

Equations201

p = \frac{1}{n} (V_{1}, V_{2}, \dots, V_{n}) .

p = \frac{1}{n} (V_{1}, V_{2}, \dots, V_{n}) .

\begin{array}[]{lll}{\rm Definition\leavevmode\nobreak\ of\leavevmode\nobreak\ }\mathrm{\mathbf{OP1}}:&\sup&\frac{1}{\beta}\Pr[V^{\prime}=0]\vspace{2mm}\\ &s.t.&\mathrm{\mathbf{E}}[V]=\mathrm{\mathbf{E}}[V^{\prime}]=1\vspace{2mm}\\ &&\mathrm{\mathbf{E}}[V^{j}]=\mathrm{\mathbf{E}}[{V^{\prime}}^{j}]\quad\mbox{for }j=1,2,\ldots,L\vspace{2mm}\\ &&V\in\left[\frac{1+\nu}{\beta},\frac{\lambda}{\beta}\right],V^{\prime}\in\{0\}\cup\left[\frac{1+\nu}{\beta},\frac{\lambda}{\beta}\right]\mbox{ and }\beta>0.\end{array}

\begin{array}[]{lll}{\rm Definition\leavevmode\nobreak\ of\leavevmode\nobreak\ }\mathrm{\mathbf{OP1}}:&\sup&\frac{1}{\beta}\Pr[V^{\prime}=0]\vspace{2mm}\\ &s.t.&\mathrm{\mathbf{E}}[V]=\mathrm{\mathbf{E}}[V^{\prime}]=1\vspace{2mm}\\ &&\mathrm{\mathbf{E}}[V^{j}]=\mathrm{\mathbf{E}}[{V^{\prime}}^{j}]\quad\mbox{for }j=1,2,\ldots,L\vspace{2mm}\\ &&V\in\left[\frac{1+\nu}{\beta},\frac{\lambda}{\beta}\right],V^{\prime}\in\{0\}\cup\left[\frac{1+\nu}{\beta},\frac{\lambda}{\beta}\right]\mbox{ and }\beta>0.\end{array}

p = \frac{1}{n} (V_{1}, V_{2}, \dots, V_{n}) .

p = \frac{1}{n} (V_{1}, V_{2}, \dots, V_{n}) .

0 < ν \leq \frac{1}{2}, λ > 1 + ν, 1 \leq β \leq min {\frac{1}{ϵ}, λ} \mbox an d L = O (lo g n) .

0 < ν \leq \frac{1}{2}, λ > 1 + ν, 1 \leq β \leq min {\frac{1}{ϵ}, λ} \mbox an d L = O (lo g n) .

p = \frac{1}{n} (V_{1}, V_{2}, \dots, V_{n}) p^{'} = \frac{1}{n} (V_{1}^{'}, V_{2}^{'}, \dots, V_{n}^{'})

p = \frac{1}{n} (V_{1}, V_{2}, \dots, V_{n}) p^{'} = \frac{1}{n} (V_{1}^{'}, V_{2}^{'}, \dots, V_{n}^{'})

\begin{array}[]{lll}{\rm Definition\leavevmode\nobreak\ of\leavevmode\nobreak\ }\mathrm{\mathbf{OP1}}:&\sup&\frac{1}{\beta}\Pr[V^{\prime}=0]\vspace{2mm}\\ &s.t.&\mathrm{\mathbf{E}}[V]=\mathrm{\mathbf{E}}[V^{\prime}]=1\vspace{2mm}\\ &&\mathrm{\mathbf{E}}[V^{j}]=\mathrm{\mathbf{E}}[{V^{\prime}}^{j}]\quad\mbox{for }j=1,2,\ldots,L\vspace{2mm}\\ &&V\in\left[\frac{1+\nu}{\beta},\frac{\lambda}{\beta}\right],V^{\prime}\in\{0\}\cup\left[\frac{1+\nu}{\beta},\frac{\lambda}{\beta}\right]\mbox{ and }\beta>0.\end{array}

\begin{array}[]{lll}{\rm Definition\leavevmode\nobreak\ of\leavevmode\nobreak\ }\mathrm{\mathbf{OP1}}:&\sup&\frac{1}{\beta}\Pr[V^{\prime}=0]\vspace{2mm}\\ &s.t.&\mathrm{\mathbf{E}}[V]=\mathrm{\mathbf{E}}[V^{\prime}]=1\vspace{2mm}\\ &&\mathrm{\mathbf{E}}[V^{j}]=\mathrm{\mathbf{E}}[{V^{\prime}}^{j}]\quad\mbox{for }j=1,2,\ldots,L\vspace{2mm}\\ &&V\in\left[\frac{1+\nu}{\beta},\frac{\lambda}{\beta}\right],V^{\prime}\in\{0\}\cup\left[\frac{1+\nu}{\beta},\frac{\lambda}{\beta}\right]\mbox{ and }\beta>0.\end{array}

OPT (OP1) = (\frac{1}{1 + ν} - \frac{1}{λ})^{2} \frac{\frac{λ}{1 + ν} - 1}{\frac{λ}{1 + ν} + 1}^{L - 2} .

OPT (OP1) = (\frac{1}{1 + ν} - \frac{1}{λ})^{2} \frac{\frac{λ}{1 + ν} - 1}{\frac{λ}{1 + ν} + 1}^{L - 2} .

E = {i = 1 \sum n \frac{V _{i}}{n} - 1 \leq ν, \mbox an d i = 1 \sum n N_{i} > s (1 - ν) /2} .

E = {i = 1 \sum n \frac{V _{i}}{n} - 1 \leq ν, \mbox an d i = 1 \sum n N_{i} > s (1 - ν) /2} .

E^{'} = {i = 1 \sum n \frac{V _{i}^{'}}{n} - 1 \leq ν, r \geq \frac{β n d}{2}, \mbox an d i = 1 \sum n N_{i}^{'} > s (1 - ν) /2}

E^{'} = {i = 1 \sum n \frac{V _{i}^{'}}{n} - 1 \leq ν, r \geq \frac{β n d}{2}, \mbox an d i = 1 \sum n N_{i}^{'} > s (1 - ν) /2}

d_{T V} (H_{E}, H_{E^{'}}^{'}) \leq \frac{2 λ}{β n ν ^{2}} + exp (- \frac{β n d}{8}) + 2 exp (- \frac{s ( 1 - ν )}{6}) + n (\frac{es λ}{2 n L})^{L} .

d_{T V} (H_{E}, H_{E^{'}}^{'}) \leq \frac{2 λ}{β n ν ^{2}} + exp (- \frac{β n d}{8}) + 2 exp (- \frac{s ( 1 - ν )}{6}) + n (\frac{es λ}{2 n L})^{L} .

ν : = 1/2, λ : = (1 + ν) \cdot (\frac{4 ( L - 2 )}{ln ( 1/ ( 27 ϵ ) )} - 1)^{2}, \mbox an d s : = ⌊ \frac{L n}{2 e λ} ⌋

ν : = 1/2, λ : = (1 + ν) \cdot (\frac{4 ( L - 2 )}{ln ( 1/ ( 27 ϵ ) )} - 1)^{2}, \mbox an d s : = ⌊ \frac{L n}{2 e λ} ⌋

d

d

= \frac{2}{3} (1 - \frac{1}{ρ})^{2} (1 - \frac{2}{ρ + 1})^{L - 2} > \frac{2}{27} (\frac{1}{e ^{2}})^{\frac{2 ( L - 2 )}{ρ + 1}} \geq \frac{2}{27} exp (- \frac{4 ( L - 2 )}{ρ + 1}) \geq 2 ϵ .

\begin{array}[]{lll}{\rm Definition\leavevmode\nobreak\ of\leavevmode\nobreak\ }\mathrm{\mathbf{LP2}}:&\sup&\mathrm{\mathbf{E}}\left[\frac{1}{X}\right]-\mathrm{\mathbf{E}}\left[\frac{1}{X^{\prime}}\right]\vspace{2mm}\\ &s.t.&\mathrm{\mathbf{E}}[X^{j}]=\mathrm{\mathbf{E}}[{X^{\prime}}^{j}]\quad\mbox{for }j=1,2,\ldots,L-1\vspace{2mm}\\ &&X,X^{\prime}\in\left[1+\nu,\lambda\right]\end{array}

\begin{array}[]{lll}{\rm Definition\leavevmode\nobreak\ of\leavevmode\nobreak\ }\mathrm{\mathbf{LP2}}:&\sup&\mathrm{\mathbf{E}}\left[\frac{1}{X}\right]-\mathrm{\mathbf{E}}\left[\frac{1}{X^{\prime}}\right]\vspace{2mm}\\ &s.t.&\mathrm{\mathbf{E}}[X^{j}]=\mathrm{\mathbf{E}}[{X^{\prime}}^{j}]\quad\mbox{for }j=1,2,\ldots,L-1\vspace{2mm}\\ &&X,X^{\prime}\in\left[1+\nu,\lambda\right]\end{array}

P_{V^{*}} (v)

P_{V^{*}} (v)

P_{V^{' *}} (v)

\int_{- \infty}^{\infty} P_{V^{*}} (v) d v

\int_{- \infty}^{\infty} P_{V^{*}} (v) d v

= \int_{1 + ν}^{λ} \frac{β ^{*} ^{2}}{x} P_{X^{*}} (x) \cdot \frac{1}{β ^{*}} d x + (1 - β^{*} E [\frac{1}{X ^{*}}])

= β^{*} E [\frac{1}{X ^{*}}] + (1 - β^{*} E [\frac{1}{X ^{*}}]) = 1,

E [V^{*}]

E [V^{*}]

= \int_{1 + ν}^{λ} β^{*} P_{X^{*}} (x) \cdot \frac{1}{β ^{*}} d x = 1

E [V^{*}^{j}]

E [V^{*}^{j}]

= \int_{1 + ν}^{λ} \frac{x ^{j - 1}}{β ^{*} ^{j - 2}} P_{X^{*}} (x) \cdot \frac{1}{β ^{*}} d x = \frac{1}{β ^{*} ^{j - 1}} E [X^{*}^{j - 1}] .

E [V^{' *}^{j}] = \frac{1}{β ^{*} ^{j - 1}} E [X^{*}^{j - 1}] = \frac{1}{β ^{*} ^{j - 1}} E [X^{' *}^{j - 1}] = E [V^{' *}^{j}] .

E [V^{' *}^{j}] = \frac{1}{β ^{*} ^{j - 1}} E [X^{*}^{j - 1}] = \frac{1}{β ^{*} ^{j - 1}} E [X^{' *}^{j - 1}] = E [V^{' *}^{j}] .

OPT (OP1) \geq \frac{1}{β ^{*}} Pr [V^{' *} = 0]

OPT (OP1) \geq \frac{1}{β ^{*}} Pr [V^{' *} = 0]

\frac{1}{β ^{*}} Pr [V^{' *} = 0] = \frac{1}{β ^{*}} (1 - β^{*} E [\frac{1}{X ^{' *}}]) = E [\frac{1}{X ^{*}}] - E [\frac{1}{X ^{' *}}] = OPT (LP2)

\frac{1}{β ^{*}} Pr [V^{' *} = 0] = \frac{1}{β ^{*}} (1 - β^{*} E [\frac{1}{X ^{' *}}]) = E [\frac{1}{X ^{*}}] - E [\frac{1}{X ^{' *}}] = OPT (LP2)

OPT (OP1) \geq \frac{1}{β ^{*}} Pr [V^{' *} = 0] = OPT (LP2) .

OPT (OP1) \geq \frac{1}{β ^{*}} Pr [V^{' *} = 0] = OPT (LP2) .

P_{X} (x) : = \frac{x}{β ^{2}} P_{V} (\frac{x}{β}), \mbox an d P_{X^{'}} (x) : = \frac{x}{β ^{2}} P_{V^{'}} (\frac{x}{β}) .

P_{X} (x) : = \frac{x}{β ^{2}} P_{V} (\frac{x}{β}), \mbox an d P_{X^{'}} (x) : = \frac{x}{β ^{2}} P_{V^{'}} (\frac{x}{β}) .

\int_{- \infty}^{+ \infty} P_{X} (x) d x = \int_{1 + ν}^{λ} \frac{x}{β ^{2}} \cdot P_{V} (\frac{x}{β}) d x = \int_{(1 + ν) / β}^{λ / β} \frac{v}{β} \cdot P_{V} (v) \cdot β d v = E [V] = 1

\int_{- \infty}^{+ \infty} P_{X} (x) d x = \int_{1 + ν}^{λ} \frac{x}{β ^{2}} \cdot P_{V} (\frac{x}{β}) d x = \int_{(1 + ν) / β}^{λ / β} \frac{v}{β} \cdot P_{V} (v) \cdot β d v = E [V] = 1

E [X^{j}] = \int_{- \infty}^{+ \infty} x^{j} P_{X} (x) d x = \int_{1 + ν}^{λ} \frac{x ^{j + 1}}{β ^{2}} \cdot P_{V} (\frac{x}{β}) d x = \int_{(1 + ν) / β}^{λ / β} \frac{β ^{j} v ^{j + 1}}{β} \cdot P_{V} (v) \cdot β d v = β^{j} E [V^{j + 1}] .

E [X^{j}] = \int_{- \infty}^{+ \infty} x^{j} P_{X} (x) d x = \int_{1 + ν}^{λ} \frac{x ^{j + 1}}{β ^{2}} \cdot P_{V} (\frac{x}{β}) d x = \int_{(1 + ν) / β}^{λ / β} \frac{β ^{j} v ^{j + 1}}{β} \cdot P_{V} (v) \cdot β d v = β^{j} E [V^{j + 1}] .

E [\frac{1}{X}] - E [\frac{1}{X ^{'}}]

E [\frac{1}{X}] - E [\frac{1}{X ^{'}}]

= \int_{1 + ν}^{λ} \frac{1}{β ^{2}} \cdot P_{V} (\frac{x}{β}) d x - \int_{1 + ν}^{λ} \frac{1}{β ^{2}} \cdot P_{V^{'}} (\frac{x ^{'}}{β}) d x^{'}

= \int_{(1 + ν) / β}^{λ / β} \frac{1}{β ^{2}} \cdot P_{V} (v) \cdot β d v - \int_{(1 + ν) / β}^{λ / β} \frac{1}{β ^{2}} \cdot P_{V^{'}} (v^{'}) \cdot β d v^{'}

= \frac{1}{β} \cdot (\int_{(1 + ν) / β}^{λ / β} P_{V} (v) \cdot d v - \int_{(1 + ν) / β}^{λ / β} P_{V^{'}} (v^{'}) \cdot d v^{'})

= \frac{1}{β} \cdot (1 - (1 - Pr [V^{'} = 0])) = \frac{1}{β} Pr [V^{'} = 0] .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Complexity and Algorithms in Graphs · Optimization and Search Problems

Full text

Towards Testing Monotonicity of Distributions Over General Posets

Maryam Aliakbarpour

CSAIL, MIT

[email protected] MA is supported by funds from the MIT-IBM Watson AI Lab (Agreement No. W1771646), the NSF grants IIS-1741137, and CCF-1733808.

Themis Gouleakis

Max Planck Institute

[email protected] TG is supported by the NSF grants CCF-1740751, CCF-1650733, CCF-1733808, and IIS-1741137. Part of this work was done while TG was a postdoctoral researcher at USC supported by Ilias Diakonikolas’ USC startup grant.

John Peebles

CSAIL, MIT

[email protected] JP is supported by the NSF grants CCF-1565235, CCF-1650733, CCF-1733808, and IIS-1741137.

Ronitt Rubinfeld

CSAIL, MIT, TAU

[email protected] RR is supported by by funds from the MIT-IBM Watson AI Lab (Agreement No. W1771646), the NSF grants CCF-1650733, CCF-1733808, IIS-1741137 and CCF-1740751.

Anak Yodpinyanee

CSAIL, MIT

[email protected] AY is supported by the NSF grants CCF-1650733, CCF-1733808, IIS-1741137 and the DPST scholarship, Royal Thai Government. This work was completed while AY was at CSAIL, MIT.

Abstract

In this work, we consider the sample complexity required for testing the monotonicity of distributions over partial orders. A distribution $p$ over a poset is monotone if, for any pair of domain elements $x$ and $y$ such that $x\preceq y$ , $p(x)\leq p(y)$ .

To understand the sample complexity of this problem, we introduce a new property called bigness over a finite domain, where the distribution is $T$ -big if the minimum probability for any domain element is at least $T$ . We establish a lower bound of $\Omega(n/\log n)$ for testing bigness of distributions on domains of size $n$ . We then build on these lower bounds to give $\Omega(n/\log{n})$ lower bounds for testing monotonicity over a matching poset of size $n$ and significantly improved lower bounds over the hypercube poset.

We give sublinear sample complexity bounds for testing bigness and for testing monotonicity over the matching poset. We then give a number of tools for analyzing upper bounds on the sample complexity of the monotonicity testing problem.

Keywords: Property Testing; Monotone Distributions; Partially Ordered Sets;

1 Introduction

We consider the problem of testing whether a distribution is monotone: an essential property that captures many observed phenomena of real-world probability distributions. For instance, monotone distributions over totally ordered sets might be used to describe distributions on diseases for which the probability of being affected by the disease increases with age. More generally, an important class of distributions are characterized by being monotone over a partially ordered set (poset). For these distributions, if a domain element $u$ lower bounds $v$ in the partial ordering (denoted $u\preceq v$ ), then $p(u)\leq p(v)$ (whereas if $u$ and $v$ are unrelated in the poset, then $p$ needs not satisfy any particular requirement on the relative probabilities of $u$ and $v$ ). Such distributions might include distributions on diseases for which the probability of being affected increases by some combination of several risk factors. Many commonly studied distributions, e.g. exponential distributions or multivariate exponential distributions, are or can be approximated by piecewise monotone functions. As monotone distributions are a fundamental class of distributions, the problem of testing whether a distribution is monotone is a key building block for distribution testing algorithms.

Given an unknown distribution, over a poset domain, the goal is to distinguish whether the distribution is monotone or far from any monotone distribution, using as few samples as possible. This problem has been considered in the literature: the problem of testing whether a distribution is monotone was first considered in the work of [BKR04], where testing the monotonicity of distributions over totally ordered domains and partially ordered domains that corresponded to two-dimensional grids were considered. The work of [BFRV10] introduced the study of testing the monotonicity of distributions over general partially ordered domains, and in particular, considered the Boolean hypercube ( $\{0,1\}^{d}$ ). Several other works considered these questions [DDS*+*13, ADK15, CDGR18] under various different domains and achieved improved sample complexity bounds.

The sample complexity of the testing problem varies greatly with the structure of the poset: On the one hand, for domains of size $n$ that are total orders, $\Theta(\sqrt{n})$ samples suffice for distinguishing monotone distributions, from those that are $\epsilon$ -far in total variation distance from any monotone distribution [BKR04, ADK15, CDGR18]. On the other hand, testing distributions defined over the matching poset requires nearly linear in $n$ , specifically $\Omega(n^{1-o(1)})$ , samples [BFRV10]. Furthermore, for a large class of familiar posets, such as the Boolean hypercubes, little is understood about the sample complexity of the testing problem.

Our results and approaches:

We first define a new property called the bigness property, which we use as our main building block for establishing sample complexity lower bounds for monotonicity testing. A distribution is $T$ -big if every domain element is assigned probability mass at least $T$ .

Though the bigness property is a symmetric property (i.e., permuting the labels of the elements does not change whether the distribution has the property or not), we use lower bounds for testing the bigness property in order to prove lower bounds on testing monotonicity, which is not a symmetric property. In addition, the bigness property is a natural property, and thus of interest in its own right.

We show that the sample complexity of the bigness testing problem is $\Theta(n/\log n)$ when $T=\Theta(1/n)$ . The upper bound follows from applying the algorithm of [VV17] that learns the underlying distribution up to a permutation of the domain elements. Our lower bound approach is inspired by the framework of [WY16a], used to lower bound the number of samples needed to estimate support sizes. Our lower bound is established by showing that the distribution of samples, one generated from $T$ -big distributions ( $p$ ’s) and the other generated from distributions that are $\epsilon$ -far from $T$ -big ( $p^{\prime}$ ’s), are statistically close. In contrast with the standard lower bound framework, $p$ and $p^{\prime}$ are not picked from two sets of distributions. Instead, the distribution $p$ (resp. $p^{\prime}$ ) is constructed by having each domain element $i$ choose its probability $p(i)$ , in an i.i.d. fashion, from the distribution $P_{V}$ (resp. $P_{V^{\prime}}$ ) over possible probabilities in $[0,1]$ . To design $P_{V}$ and $P_{V^{\prime}}$ , we introduce a new optimization problem that maximizes $\epsilon$ while keeping the distribution of samples statistically close. This constraint is established via the moments matching technique, which allows us to show that the distributions are indistinguishable with $o(n/\log{n})$ samples, but also plays a crucial role in many other settings [RRSS09, Val08, BFRV10, VV16, VV17, WY16a, WY16b].

By reducing from the bigness testing problem, we next give a lower bound of $\Omega(n/\log{n})$ on the sample complexity of the monotonicity testing problem over the matching poset, improving on the $\Omega\left(n/2^{\Theta(\sqrt{\log n})}\right)$ lower bound in [BFRV10]. In addition to improving the sample complexity lower bound, one particularly useful byproduct of our approach is that the maximum probability of an element in the constructed lower bound distribution families can be made small, which assists us in proving lower bounds for other posets in the following.

Finally, we leverage the lower bound for the monotonicity testing problem over the matching poset to prove a lower bound of $N^{1-\delta}$ for $\delta=\Theta(\sqrt{\epsilon})+o(1)$ for monotonicity testing over the Boolean hypercube of size $N=2^{d}$ , greatly improving upon the standard “Birthday Paradox” lower bound of $\Omega(\sqrt{N})$ . Our reduction follows from finding a large embedding of the matching poset in the hypercube, and its efficiency follows from the previously mentioned upper bound on the maximum element probability from the bigness lower bound construction above.

We then give a number of new tools for analyzing upper bounds on the sample complexity of the monotonicity testing problem:

We prove that the distance of a distribution to monotonicity can be characterized approximately as the weight of a maximum weighted matching in the transitive closure of the poset, where the weight of the edge $(u,v)$ is the amount of violation from being monotone: $\max(0,p(u)-p(v))$ . This characterization gives a structural result about distributions that are $\epsilon$ -far from monotone. Moreover, this results extends the work of [FLN*+*02] to non-boolean valued functions. The work of [FLN*+*02] shows that the distance of a boolean function $f$ to monotonicity is related to the number of “violating edges” in the transitive closure of the underlying poset. 2. 2.

Via the characterization above, we show that the monotonicity testing problem over bipartite posets (where all edges are directed in the same direction) captures the monotonicity testing problem in its full generality. That is, we give a reduction from monotonicity testing over any poset to monotonicity testing over a bipartite poset. Our reduction preserves the number of vertices and the distance parameter up to a constant multiplicative factor. As in the previous, this result extends the work of [FLN*+*02] to non-boolean valued functions. 3. 3.

Leveraging the learning algorithms for symmetric distributions in [VV17], we propose algorithms with sample complexity $O(n/(\epsilon^{2}\log n))$ for testing bigness of a distribution, and for testing monotonicity on matching posets. The proof of our latter result requires certain subtle details: (1) an additional reduction that allows us to scale our distribution for “each side” of the matching, in order to generate sufficient samples from each side, as required by the algorithm of [VV17], and (2) technical lemmas establishing bounds between the total variation distance and the distance notion in [VV17], under the scaling mentioned earlier. 4. 4.

We give a reduction from monotonicity testing on a bipartite poset, to monotonicity testing on the matching (for which the testing algorithm is constructed above). This reduction gives an algorithm for monotonicity testing on any bipartite poset (which is the most general problem, as argued earlier), in which the overhead in the sample complexity depends only on the maximum degree of the bipartite graph. 5. 5.

We give another upper bound for testing monotonicity on bipartite posets: $O((\log M)/\epsilon^{2})$ where $M$ is the number of “endpoint sets” of all possible matchings contained in the given bipartite graph (or equivalently, the number of induced subgraphs that admit a perfect matching over their respective vertex sets). Note that for the matching poset, $M=2^{n}$ yields an $O(n/\epsilon^{2})$ upper bound, and therefore for matching posets our previous algorithm is preferable. However, this bound yields an upper bound of $O(n/\epsilon^{2})$ for all posets, and could potentially be even smaller for certain classes of graphs, such as collections of large stars. 6. 6.

Finally, we give an upper bound of $O(\frac{n^{2/3}}{\epsilon}+\frac{1}{\epsilon^{2}})$ samples for monotonicity testing on bipartite posets, under the guarantee that the distribution being tested is a uniform distribution on some subset of known size of the domain. This special case is of interest in that it relates to the well studied problem of testing monotonicity of Boolean functions, in a somewhat different setting where instead of getting query access to the function, we are given uniform “positive” samples of domain elements $x$ for which $f(x)=1$ .

Other related work

Batu, Kumar, and Rubinfeld [BKR04] initiated the study of testing monotonicity of distributions. For the case where the domain is totally ordered, the sample complexity is known to be ${\Theta}(\sqrt{n})$ [BKR04, ADK15, CDGR18]. Several works have considered distributions over higher dimensional domains. In [BKR04, BFRV10], it is shown that testing monotonicity of a distribution on the two dimensional grid $[m]\times[m]$ (here $N=m^{2}$ ) can be performed using $\widetilde{O}(N^{3/4})$ samples. For higher dimensional grids $[m]^{d}$ (where $N=m^{d}$ ), Bhattacharyya et al. provided an algorithm that uses $\widetilde{O}(m^{d-1/2})=\widetilde{O}(N/\sqrt[2d]{N})$ samples [BFRV10]. Acharya et al. gave an upper bound of $O(\frac{\sqrt{N}}{\epsilon^{2}}+(\frac{d\log m}{\epsilon^{2}})^{d}\cdot\frac{1}{\epsilon^{2}})$ and a lower bound of $\Omega(\sqrt{N}/\epsilon^{2})$ [ADK15]. While their result gives a tight bound of $\Theta(\sqrt{N}/\epsilon^{2})$ when $d$ is relatively small compared to $m$ , it does not yield a tester for Boolean hypercubes using a sublinear number of samples.

Bhattacharyya et al. considered the problem of monotonicity testing over general posets [BFRV10]. In particular, they proposed an algorithm for testing the monotonicity of distributions over hypercubes (where $N=2^{d}$ ) using $\tilde{O}(N/(\log N/\log\log N)^{1/4})$ samples. They provide a lower bound of $\Omega(n^{1-o(1)})$ for testing monotonicity of distributions over a matching of size $n$ , and a lower bound of $\Omega(\sqrt{n})$ when the poset contains a linear-sized matching in the transitive closure of its Hasse digraph.

In addition to the above, testing monotonicity of distributions has been considered in various settings [ACS10, DDS12, Can15]. There are several works on testing various properties, e.g. uniformity, closeness, and independence when the underlying distribution is monotone [BDKR05, BKR04, RS05, DDS*+*13, AJOS13].

Testing monotonicity of boolean functions is also well studied (e.g., [GGLR98, DGL*+*99, LR01, FLN*+*02, CS13, CS14, BB16, BCS18]). In the general regime, the algorithm can query the value of the function at any element in the poset. This ability is in sharp contrast with our model, in which the algorithm only receive samples according to the distribution, which do not directly reveal the probability of the elements. It is known that one can test monotonicity of functions over hypergrids, and hypercubes using as few as polylogarithmic queries in the size of the domain. This query complexity is exponentially smaller than the sample complexity of testing monotonicity of distributions, demonstrating that there are inherent differences between the two problems.

2 Preliminaries

We use $[n]$ to indicate the set $\{1,2,\ldots,n\}$ . Throughout this paper we use the total variation distance denoted by $d_{TV}$ unless otherwise stated. We also denote the $\ell_{1}$ -distance by $d_{\ell_{1}}$ . For a distribution $p$ , we denote the probability of the domain element $x$ by $p(x)$ . Given a multiset of samples from a distribution on $[n]$ , the histogram of the samples is an $n$ -dimensional vector, $h=(h_{1},h_{2},\ldots,h_{n})$ , where $h_{i}$ is the frequency of the $i$ -th element in the sample set.

A poset $G=([n],E)$ is called a line if and only if $E$ contains all the edges $(i,i+1)$ for $1\leq i\leq n$ . We say a poset is a matching if all of the edges in the poset are vertex-disjoint. We say a poset is bipartite if the set of vertices can be decomposed in two sets, the top set and the bottom set, where no two vertices in the same set are connected. Moreover, the direction of all the edges is from the top set to the bottom set. We use similar terminology for the matching poset as well. In addition, we say a poset $G=(V,E)$ is an $n$ -dimensional hypercube when $V$ is $\{0,1\}^{n}$ and $E$ contains all edges $(u,v)$ where there exists a coordinate $i$ such that $u_{i}=0$ and $v_{i}=1$ and $u_{j}=v_{j}$ for all $i\neq j$ .

Monotonicity.

A partially-ordered set (poset) is described as a directed graph $G=(V,E)$ , where each edge $(u,v)$ indicates the relationship $u\preceq v$ on the poset. A matching poset is a poset where the underlying graph $G$ is a matching. A distribution $p$ over a poset domain $V=\{v_{1},\ldots,v_{n}\}$ is a distribution over the vertex set $V$ . A distribution $p$ is monotone (with respect to a poset $G$ ) if for every edge $(u,v)\in E$ (i.e., every ordered pair $u\preceq v$ ), $p(u)\leq p(v)$ . Let $\mathsf{Mon}(G)$ be the set of all monotone distributions over the poset $G$ . We say that $p$ is $\epsilon$ -far from monotone if its distance to monotonicity, $d_{TV}(p,\mathsf{Mon}(G))\coloneqq\min_{q\in\mathsf{Mon}(G)}d_{TV}(p,q)$ , is at least $\epsilon$ .

Definition 2.1.

Let $p$ be a distribution on poset $G$ and $\epsilon$ be the proximity parameter. Suppose an algorithm, $\mathcal{A}$ , has sample access to $p$ and the full description of poset $G$ . $\mathcal{A}$ is called a monotonicty tester for distributions if the following is true with probability at least $2/3$ when the tester has sample access to the distribution.

•

If $p$ is monotone, then $\mathcal{A}$ outputs accept.

•

If $p$ is $\epsilon$ -far from monotone, then $\mathcal{A}$ outputs reject.

Bigness.

A probability distribution $p$ over a domain $[n]=\{1,\ldots,n\}$ is $T$ -big if, for every domain element $i\in[n]$ , $p(i)\geq T$ . Related notions for distance to $T$ -bigness are defined analogously. The parameter $T$ is called the bigness threshold, and may be omitted if it is clear from the context. Let $\mathsf{Big}(n,T)$ indicate the set of all distributions over $[n]$ that are $T$ -big. We define the distance to $T$ -bigness as $d_{TV}\left(p,\mathsf{Big}(n,T)\right)=\min_{q\in\mathsf{Big}(n,T)}d_{TV}(p,q)$ . If this distance is at least $\epsilon$ , we say the distribution is $\epsilon$ -far from being $T$ -big.

Definition 2.2.

Let $p$ be a distribution on $[n]$ . Suppose Algorithm $\mathcal{A}$ receives threshold $T$ and bigness parameter $\epsilon$ , and has sample access to $p$ . $\mathcal{A}$ is a $T$ -bigness tester if the following is true with probability at least $2/3$ .

•

If $p$ is $T$ -big, then $\mathcal{A}$ outputs accept.

•

If $p$ is $\epsilon$ -far from $T$ -big, then $\mathcal{A}$ outputs reject.

Also, $T$ -bigness testing problem refers to the task of distinguishing the above cases with high probability.

Remark 2.3.

Note that the probability $2/3$ is arbitrary in the above definitions. One can amplify the probability of outputting the correct answer to $1-\delta$ by increasing the number of samples by an $O(\log 1/\delta)$ factor.

3 Overview of Our Techniques

In this section, we give an overview of our results and the high-level idea of our techniques.

3.1 A lower bound for the bigness testing problem

In Section 4, we provide two random processes for generating histograms of samples from two families of distributions, such that one family consists of “big” distributions, and the other family largely of “ $\epsilon$ -far from big” distributions. Then, we show that unless a large number of samples have been drawn, the distributions over the histograms generated via these two random processes are statistically very close to each other, and hence appear indistinguishable to any algorithm, as specified precisely in Theorem 3.1. The construction yields a lower bound for the general problem of testing the bigness property in Corollary 3.1. Furthermore, the construction provides a useful building block for establishing further lower bounds for monotonicity testing in various scenarios in Section 5.

To generate histograms from the two families of distributions, imagine the following process: We have two prior distributions $P_{V}$ and $P_{V^{\prime}}$ , and we generate probability vectors (measures), $p$ and $p^{\prime}$ , according to the priors: Each domain element $i$ randomly picks its probability in an i.i.d fashion from the prior distribution. More precisely, let $V_{1},V_{2},\ldots,V_{n}$ be $n$ i.i.d. random variables from prior $P_{V}$ , then $p$ is defined to be the following:

[TABLE]

We generate $p^{\prime}$ similarly according to prior $P_{V^{\prime}}$ . While the total probability is unlikely to sum to $1$ , we will design the priors, $P_{V}$ and $P_{V^{\prime}}$ , so that we can later modify $p$ or $p^{\prime}$ into a probability distribution with only small changes. We then generate histograms of samples from (the normalization of) $p$ by drawing $n$ independent random variables $h_{i}\sim\mathrm{\mathbf{Poi}}(s\cdot p(i))$ (namely $h_{i}\sim\mathrm{\mathbf{Poi}}(sV_{i}/n)$ ) for $i=1,\ldots,n$ , and output $h=(h_{1},\ldots,h_{n})$ as the histogram of the samples. Note that by Poissonization method, one may view the histogram as being generated from a set of $\mathrm{\mathbf{Poi}}(s\cdot\sum_{i}{V_{i}}/n)$ samples from the normalization of $p$ . Hence, if $\sum_{i}{V_{i}}/n$ is close to one, the histogram serves as a set of roughly $s$ samples. We set $s$ more specifically in terms of the rest of the parameters later.

The goal in Section 4 is to find two prior distributions $P_{V}$ and $P_{V^{\prime}}$ , then generate two probability vectors $p$ and $p^{\prime}$ , and two histograms $h$ and $h^{\prime}$ according to them respectively, such that the following events hold with high probability.

The probability vectors $p$ and $p^{\prime}$ are approximate probability distributions; that is, their total probability masses are each close to $1$ . 2. 2.

After scaling the probability vectors $p$ and $p^{\prime}$ above into respective probability distributions, the normalization of $p$ is $T$ -big, and the normalization of $p^{\prime}$ is $\epsilon$ -far from any $T$ -big distribution. 3. 3.

The total numbers of (Poissonized) samples in $h$ and $h^{\prime}$ drawn from the normalization of $p$ and $p^{\prime}$ are each $\Omega(s)$ , where $s$ is the sample complexity lower bound we are aiming to prove. 4. 4.

Given $h$ or $h^{\prime}$ , distinguishing whether it is generated from $P_{V}$ or $P_{V^{\prime}}$ with success probability $2/3$ requires $h$ or $h^{\prime}$ to contain at least $s$ samples. 5. 5.

Additionally, we will bound the largest probability mass $p_{\textrm{max}}$ that the normalized distributions place on any domain element – this part is not necessary for this section, but will be useful for the reduction between monotonicity testing and bigness testing later on.

Now, if we choose $P_{V}$ and $P_{V^{\prime}}$ carefully such that $h$ and $h^{\prime}$ are generated according to the above process based on $P_{V}$ and $P_{V^{\prime}}$ are hard to distinguish, then we can establish a lower bound for the bigness testing problem. We state this result more formally as the following theorem in Section 4.

{restatable*}

theorembignessLB For integer $L=O(\log n)$ and sufficiently small $\epsilon=\Omega(1/n)$ , there exist a parameter $\beta=\beta(L,\epsilon)$ and two distributions $\mathcal{H}^{+}$ and $\mathcal{H}^{-}$ over the set of possible histograms of size at least $s=\Omega\left(n^{1-1/L}\log^{2}(1/\epsilon)/L\right)$ with the following properties:

•

The histogram generated from $\mathcal{H}^{+}$ is drawn from a $1/(\beta n)$ -big distribution.

•

The histogram generated from $\mathcal{H}^{-}$ is drawn from a distribution which is $\epsilon$ -far from any $1/(\beta n)$ -big distribution.

•

$d_{TV}\left(\mathcal{H}^{+},\mathcal{H}^{-}\right)\leq 0.01$ .

•

The largest probability mass among any elements in any probability distributions above (from which the histograms are drawn) is $p_{\textrm{max}}=O(L^{2}/(n\log^{2}(1/\epsilon)))$ .

An important case of this theorem is when $L=\Theta(\log n)$ , where we establish a nearly linear sample complexity lower bound of $\Omega(n/\log n)$ for the general problem of bigness testing as follows.

{restatable*}

corollarybignessTest For sufficiently small parameter $\epsilon=\Omega(1/n)$ , there exists a parameter $\beta=\beta(\epsilon)$ such that any algorithm that can distinguish whether a distribution over $[n]$ is $1/(\beta n)$ -big or $\epsilon$ -far from any $1/(\beta n)$ -big distribution with probability $2/3$ requires $\Omega(n\log^{2}(1/\epsilon)/\log n)$ samples. In particular when $\epsilon$ is a constant, $\beta$ is constant, then any such algorithm requires $\Omega(n/\log n)$ samples.

We propose the following optimization problem, $\mathrm{\mathbf{OP1}}$ , such that its optimal solution specifies $P_{V}$ and $P_{V^{\prime}}$ , satisfying the requirements of the theorem. Intuitively speaking, as $P_{V}$ aims to generate $T$ -big distributions, we must ensure that $V_{i}$ ’s are bounded away from $1/\beta$ , so that $p(i)=V_{i}/n$ has expected value higher than $T=1/(\beta n)$ . At the same time, we hope to maximize the probability that $V^{\prime}_{i}=0$ so that $p^{\prime}$ has lots of domain elements with probability zero to make its normalization far from any $T$ -big distribution. In addition, we find $P_{V}$ and $P_{V^{\prime}}$ under the constraint that the first $L$ moments of them are exactly matched, as to ensure that the resulting distributions over the histograms, $\mathcal{H}$ and $\mathcal{H}^{\prime}$ , are statistically close. The objective value of this optimization problem corresponds to the expected distance of $p^{\prime}$ to the closest $T$ -big distribution in the $\ell_{1}$ -distance.

[TABLE]

In the above optimization problem, the unknowns are $P_{V}$ , $P_{V}^{\prime}$ , and $\beta$ . $\nu$ and $\lambda$ are two parameters specified latter in the proof. That is we are looking for two distributions $P_{V}$ and $P_{V}^{\prime}$ such that two random variables $V$ and $V^{\prime}$ drawn from them respectively have expected value one, and their first $L$ moments are matched. Also, $\beta$ controls the range of the probabilities, $p(i)$ ’s and $p^{\prime}(i)$ ’s, and the distance to the bigness property.

We relate the optimal solution for $\mathrm{\mathbf{OP1}}$ to an LP defined by [WY16a], who in turn relate their LP to the error from the best polynomial approximation of the function $1/x$ over the interval $[1+\nu,\lambda]$ . By doing this, we show the existence of a solution $(P_{V},P_{V^{\prime}})$ where the value $\Pr[V^{\prime}=0]$ , which is proportional to the distance to $1/(\beta n)$ -bigness in the second family, is sufficiently large.

Our proof relies on and extends the lower bound techniques for estimating support size provided in [WY16a], incorporating specific conditions for the bigness problem. Firstly, unlike the support size estimation problem, we need our distributions to be fully-supported on the domain $[n]$ for the big distributions, whereas in their case, both families of distributions are allowed to be partially supported. Secondly, our optimization problem treats the threshold $1/(\beta n)$ as a variable, whereas the support size problem simply imposes the strict threshold of $1/n$ . Thirdly, based on this construction, we must also give a direct upper bound for the maximum probability, which facilitates our later proofs for providing lower bounds for the matching and hypercube posets.

3.2 From bigness lower bounds to monotonicity lower bounds

In Section 5, we show how to turn our lower bound results for bigness testing problem in Section 4, into lower bounds for monotonicity testing in some fundamental posets, namely the matching poset and the Boolean hypercube poset.

Matching poset.

To establish our lower bound for testing monotonicity of the matching poset, we construct our distribution $p$ by assigning probability masses to the endpoints of edges $(u_{i},v_{i})$ in our matching as follows: the vertices $u_{i}$ ’s are assigned probability masses according to the $T$ -bigness construction, whereas the vertices $v_{i}$ ’s are uniformly assigned the threshold $T$ as their probability masses; the assigned probabilities are then normalized into a proper probability distribution. We show that before normalization, $p(u_{i})\leq T=p(v_{i})$ if the original distribution is big; and otherwise, the distance to the monotonicity of the constructed distribution measures exactly the distance to the $T$ -bigness property. We then show that the normalization step scales the entire distribution $p$ down by only a constant factor, hence the lower bounds for the monotonicity testing over the matching poset with $2n$ vertices asymptotically preserves the parameters $\epsilon,s$ and $p_{\textrm{max}}$ of the lower bound on bigness construction for $n$ domain elements.

Hypercube poset.

To achieve our results for the Boolean hypercube, we embed our distributions over the matching poset into two consecutive levels $\ell$ and $\ell-1$ of the hypercube (where $\ell$ denotes the number of ones in the vertices’ binary representation). We pair up elements in these levels in such a way that distinct edges of the matching have incomparable endpoints: the algorithm must obtain samples of these matched vertices in order to decide whether the given distribution is monotone or not. We also place probability mass $p_{\textrm{max}}$ on all other vertices on level $\ell$ and above, and probability mass [math] on all remaining vertices, in order to ensure that the distribution is monotone everywhere else. Lastly, we rescale the entire construction down into a proper probability distribution. Unlike the matching poset, sometimes this scaling factor is super-constant, shrinking the overall distance to monotonicity, $\epsilon$ , to sub-constant. Here, we make use of our upper bound on $p_{\textrm{max}}$ of the bigness lower bound construction to determine the scaling factor.

3.3 Reduction from general posets to bipartite graphs

In Section 6, we show that the problem of monotonicity testing of distributions over the bipartite posets is essentially the “hardest” case of monotonicity testing in general poset domains. That is, we show that for any distribution $p$ over some poset domain of size $n$ , represented as a directed graph $G$ , there exists a distribution $p^{\prime}$ over a bipartite poset $G^{\prime}$ of size $2n$ such that (1) $p$ preserves the total variation distance of $p$ to monotonicity up to a small multiplicative constant factor, and (2) each sample for $p^{\prime}$ can be generated using one sample drawn from $p$ . These properties together imply the following main theorem of the section.

{restatable*}

theoremgeneraltobipartite

Suppose that there exists an algorithm that tests monotonicity of a distribution over a bipartite poset domain of $n$ elements using $s(n,\epsilon)$ samples for any total variation distance parameter $\epsilon>0$ . Then, there exists an algorithm that tests monotonicity of a distribution over any poset domain of $n$ elements using $O(s(2n,\epsilon/4))$ samples.

Our approach may be summarized as follows. We first show, in Theorem 3.3, that we may characterize (up to a constant factor) the distance of $p^{\prime}$ to monotonicity, as the size of the maximum matching on the transitive closure of $G$ , denoted by $TC(G)$ , where the weight $w(u,v)\coloneqq\max\{p(u)-p(v),0\}$ represents the amount that $(u,v)$ is violating the monotonicity condition. In particular, we have the following theorem:

{restatable*}

theoremdistToMonMatching Consider a poset $G=(E,V)$ and a distribution $p$ over its vertices. Suppose every edge $(u,v)$ in the $TC(G)$ has a weight of $\max(0,p(u)-p(v))$ . Then, the total variation distance of $p$ to any monotone distribution is within a factor of two of the weight of the maximum weighted matching in $TC(G)$ .

This crucial theorem provides a combinatorial way to approximate the distance to monotonicity for general posets, leading to our upcoming construction of $p^{\prime}$ for Theorem 3.3 as well as some algorithms in Section 7. Theorem 3.3 is shown via LP duality: the dual LP for the problem of optimally “fixing” $p$ to make it monotone, turns out to align with the maximum (fractional) matching problem on $G$ ’s transitive closure. In particular, the dual constraints are of the form $\{Ay\leq b,y\geq 0\}$ where $A$ is a totally unimodular matrix, implying that an integral optimal solution exists, namely the maximum matching.

To prove Theorem 3.3, given the original poset $G=(V,E)$ , we create a bipartite poset with two copies $u^{-}$ and $u^{+}$ of each original vertex $u\in V$ : the vertices $u^{-}$ ’s and $u^{+}$ ’s form the bipartition of the new bipartite poset $G^{\prime}$ of size $2n$ . We add $(u^{-},v^{+})$ to the bipartite poset if $(u,v)$ is in the transitive closure of $G$ ; that is, there exists a directed path from $u$ to $v$ in $G$ . The new probability distribution $p^{\prime}$ on $G^{\prime}$ , is created from $p$ on $G$ , by dividing the probability mass $p(u)$ equally among $p^{\prime}(u^{-})$ and $p^{\prime}(u^{+})$ . Note that a sample from $p^{\prime}$ is obtained by drawing from $p$ and adding the sign $-/+$ equiprobably. It follows via transitivity that $p^{\prime}$ is monotone over $G^{\prime}$ when $p$ is monotone over $G$ , and via Theorem 3.3 that if $p$ is $\epsilon$ -far from monotone on $G$ , then $p^{\prime}$ is also at least $\epsilon/4$ -far from monotone over $G^{\prime}$ . These conditions allow us to test monotonicity of $p$ on any general poset $G$ by instead testing monotonicity of $p^{\prime}$ on a bipartite poset $G^{\prime}$ with parameter $\epsilon^{\prime}=\epsilon/4$ , as desired.

3.4 Upper bounds results

In Section 7, we provide sublinear algorithms for testing bigness, and testing monotonicity of distributions over different poset domains.

Bigness testing.

In Section 7.1, we provide an algorithm for bigness testing. Observe that the $T$ -bigness property is a symmetric property: closed under permutation of the labels of the domain elements $[n]$ . Hence, we leverage the result of [VV17] that learns the counts of elements for each probability mass: $h_{p}(x)=|\{a:p(a)=x\}|$ . Observe that the distance to $T$ -bigness is proportional to the total “deficits” of elements with probability mass below $T$ . Hence, this learned information suffices for constructing an algorithm for testing bigness, using a sub-linear, $O(\frac{n}{\epsilon^{2}\log n})$ , number of samples.

Monotonicity testing for matchings.

Next, in Section 7.2, we provide an algorithm for testing monotonicity of matching posets. We again resort to the work of [VV17] for learning the counts of elements for each pair of probability masses, with respect to a pair of distributions $p_{1},p_{2}$ over the domain $[n]$ , namely $h_{p_{1},p_{2}}(x,y)=|\{a:p_{1}(a)=x,p_{2}(a)=y\}|$ , given $O(\frac{n}{\epsilon^{2}\log n})$ samples each from $p_{1}$ and $p_{2}$ . We hope to consider our distribution $p$ over a matching $G=(S\cup T,E)$ with $E=\{(u_{i},v_{i})\}_{i\in[n]}\subset S\times T$ as a pair of distributions, namely $p_{S}$ and $p_{T}$ , representing probability masses $p$ places over $u_{i}\in S$ and $v_{i}\in T$ , respectively. Learning $h_{p_{S},p_{T}}$ would intuitively allows us to approximate $p$ ’s distance to monotonicity by summing up the “violation” for pairs $x<y$ . However, there are subtle challenges to this approach that do not present in the earlier case of bigness testing.

First, we must somehow rescale $p_{S}$ and $p_{T}$ up into distributions according to their total masses $w_{S}$ , $w_{T}$ placed by $p$ . However, it is possible that, say, $p_{S}=o(1)$ , making samples from $S$ costly to generate by drawing i.i.d. samples from $p$ . We resolve this issue via a reduction to a different distribution $p^{\prime}$ that approximately preserves the distance to bigness, while placing comparable total probability masses to $S$ and $T$ . Second, the algorithm of [VV17] learns $h_{p_{1},p_{2}}(x,y)$ according to a certain distance function, that we must lower-bound by the total variation distance. In particular, this bound must be established under the presence of errors in the scaling factor, as $w_{S}$ and $w_{T}$ are not known to the algorithm. We overcome these technical issues, which yields an algorithm for testing monotonicity over matchings. We maintain the same asymptotic complexity as that of [VV17].

Monotonicity testing for bounded-degree bipartite graphs.

Moving on, in Section 7.3, we tackle the problem of monotonicity testing in bipartite posets; as shown in Section 6, this bipartite problem captures the monotonicity testing problem of any poset. We make progress towards resolving this problem by offering our solution for the bounded-degree case. We turn the distribution $p$ on a bipartite poset $G$ of maximum degree $\Delta$ , into a distribution $p^{\prime}$ on a matching poset $G^{\prime}$ that approximately preserves the distance to monotonicity: applying the algorithm of Section 7.2 above constitutes a monotonicity test for $p$ with sample complexity $O(\frac{\Delta^{3}n}{\epsilon^{2}\log n})$ .

Our reduction simply places $\Delta$ copies $v_{1},\ldots,v_{\Delta}$ of each vertex $v\in V(G)$ into $V(G^{\prime})$ , then for each edge $(u,v)\in E(G)$ , connects a pair of unused endpoints $(u_{i},v_{j}$ ), as to create a matching subgraph of size $|E(G)|$ on $G^{\prime}$ . The probability distribution $p^{\prime}$ on $V(G^{\prime})$ simply distributes probability mass $p(v)$ equally among all $\Delta$ copies $v_{i}$ ’s. (Each remaining, isolated vertex is matched with a dummy [math]-mass vertex, turning $G^{\prime}$ into a matching poset.) This new graph $G^{\prime}$ contains $O(\Delta n)$ vertices, and we show that $d_{TV}(p^{\prime},\mathsf{Mon}(G^{\prime}))\geq d_{TV}(p,\mathsf{Mon}(G))/(2\Delta)$ by explicitly creating a “low-cost” scheme for “fixing” $p$ into a monotone distribution on $G$ , based on the optimal scheme that turns $p^{\prime}$ monotone on $G^{\prime}$ , charging at most an extra $2\Delta$ -multiplicative factor.

Testing monotonicity of distributions that are uniform on a subset of the domain.

In Section 7.4, we show that for a specific broad family of distributions on directed bipartite graphs of arbitrary degree, we can test monotonicity of such distribution using $O(\frac{n^{2/3}}{\varepsilon}+\frac{1}{\varepsilon^{2}})$ samples. Namely, our result applies for distributions that are uniform on an arbitrary subset of the domain, given that every poset edge is directed from some vertex in the “bottom” part to some vertex in the “top” part of the graph. Our tester performs roughly the following: First, we sample a number of vertices from the graph and throw away ones that lie in the top part. For the remaining ones in the bottom part, denoted $B$ , we identify their neighbors $T$ in the top part, and determine whether or not they all belong to the support of the distribution. Since the distribution is uniform in its support, this condition is sufficient for the distribution to be monotone in the induced subgraph $G[B\cup T]$ . The tester accepts when it cannot rule out the possibility that $T$ has the maximum possible probability mass. Recall that if the distribution is $\epsilon$ -far from monotone, there must exist a large matching of “violated” edges. To this end, we show that the induced subgraph $G[B\cup T]$ contains many disjoint violated edges, implying that there are many vertices in $T$ outside of the support: the probability mass on $T$ will be noticeably small and the tester will reject.

Upper bound via trying all matchings.

In Section 7.5 we give another upper bound for testing monotonicity of a distribution with respect to a bipartite graph which, in this case, has a small number of induced subgraphs that contains a perfect matching of their vertices. In particular, we show that $O(\frac{\log M}{\epsilon^{2}})$ samples are sufficient for this task, where $M$ is the number of such induced subgraphs. We note that this bound matches the general learning upper bound of $O(n/\epsilon^{2})$ when $M$ attains its maximum value of $2^{\Theta(n)}$ , but can potentially be better when $M$ is asymptotically smaller. The main idea of our tester is as follows: if the distribution is $\epsilon$ -far from monotone, there exists a matching of violated edges that is $\Theta(\epsilon)$ -far from monotone. Hence, for each subgraph of $G$ that admits a perfect matching, we may approximate the weight (violation amount) of this matching by simply comparing the total probability masses between the top part and the bottom part of the subgraph. We approximate these masses with error probability $O(1/M)$ for each subgraph, which allows us to apply a union bound over all subgraphs at the end. Our tester rejects if the weight of one such subgraph exceeds $\epsilon$ , or accepts otherwise.

4 A Lower Bound for the Bigness Testing Problem

In this section, we give a lower bound for the bigness testing problems. As described in the overview in Section 3.1, we provide two random processes for generating samples from two families of distributions, such that one family consists of “big” distributions, and the other family largely of “ $\epsilon$ -far from big” distributions, and then show that they are hard to distinguish.

First, we define a random process that, given a prior distribution, $P_{V}$ , over non-negative numbers, generates a random probability distribution over the domain elements $[n]$ , and then draws samples from it. More specifically, let $V$ be a random variable drawn from $P_{V}$ , and we also use $P_{V}$ to denote the probability density function (PDF) over $V$ ; for now we require $\mathrm{\mathbf{E}}[V]=1$ , and will specify further desired properties momentarily. We generate an approximate probability distribution $p$ according to $P_{V}$ . The distribution $p$ is constructed by having each domain element $i$ choose its probability $p(i)$ , in an i.i.d. fashion, from the prior distribution, $P_{V}$ , over possible probabilities. Then, we construct a histogram of roughly $s$ samples from $p$ according to the following steps:

•

Step 1: Generate $n$ i.i.d. random variables $V_{1},V_{2},\ldots,V_{n}$ according to $P_{V}$ , then form the following probability vector over $[n]$ :

[TABLE]

Remark that, while $p$ is not necessarily a probability distribution under this notion, the condition $\mathrm{\mathbf{E}}[V]=1$ suggests that the total probability masses of $p$ is likely to be centered around $1$ . So, $p$ is likely to be approximately a probability distribution, and can be normalized into one while modifying individual entries $p(i)$ ’s by only a small multiplicative factor.

•

Step 2: Draw $n$ independent random variables $h_{i}\sim\mathrm{\mathbf{Poi}}(s\cdot p(i))$ (namely $h_{i}\sim\mathrm{\mathbf{Poi}}(sV_{i}/n)$ ) for $i=1,2,\ldots,n$ , and output $h=(h_{1},h_{2},\ldots,h_{n})$ as the histogram of the samples. While we do not explicitly normalize $p$ , since $p$ is an approximate probability distribution, this histogram still captures (with high probability) $\Omega(s)$ Poissonized samples drawn from the normalization of $p$ .

The goal in this section is to find two prior distributions $P_{V}$ and $P_{V^{\prime}}$ , to generate two probability vectors $p$ and $p^{\prime}$ according to the above process such that after the normalization, $p$ and $p^{\prime}$ have the desire properties: $p$ is big (every $p(i)$ is at least the threshold $T$ ), and $p^{\prime}$ is $\epsilon$ -far from any big distribution ( $p^{\prime}$ contains a significant number of entries $i$ with $p^{\prime}(i)=0$ ). Then, we generate two histograms $h$ and $h^{\prime}$ according to $p$ and $p^{\prime}$ respectively. If the histograms $h$ and $h^{\prime}$ are hard to distinguish, then we can establish a lower bound for the bigness property. This requirement will show up as constraints for designing two prior distributions, $P_{V}$ and $P_{V^{\prime}}$ , to achieve these families of distributions with high probability. Below, we summarize the conditions that we need the prior distributions to hold (with high probability):

The probability vectors $p$ and $p^{\prime}$ are approximate probability distributions; that is, all of their coordinates are non-negative and their total probability masses are each close to one. 2. 2.

After scaling the probability vectors $p$ and $p^{\prime}$ above into respective probability distributions, the normalization of $p$ is $T$ -big, and the normalization of $p^{\prime}$ is $\epsilon$ -far from any $T$ -big distribution. 3. 3.

The total numbers of (Poissonized) samples in $h$ and $h^{\prime}$ drawn from the normalization of $p$ and $p^{\prime}$ are each $\Omega(s)$ . 4. 4.

Given $h$ or $h^{\prime}$ , distinguishing whether it is generated from $P_{V}$ or $P_{V^{\prime}}$ with success probability $2/3$ requires $h$ or $h^{\prime}$ to contain a large number of samples. 5. 5.

Additionally, we will bound the largest probability mass $p_{\textrm{max}}$ that the normalized distributions place on any domain element – this part is not necessary for this section, but will be useful for the reduction between monotonicity testing and bigness testing later on.

We state this result as the following theorem.

\bignessLB

Proof.

Let positive values $\nu$ , $\lambda$ , $\beta$ , and a positive integer $L$ be a set of parameters with the following property that we determine more precisely later:

[TABLE]

Throughout this section, we consider the bigness threshold $T=1/(\beta n)$ , and note that the value $\beta$ itself may depend on the error parameter $\epsilon$ , an the number of matched moments $L$ . Note also that $\nu$ is a constant.

We propose the following optimization problem, $\mathrm{\mathbf{OP1}}$ , such that its optimal solution, specifying $P_{V}$ and $P_{V^{\prime}}$ satisfies the requirements of the theorem. Recall that $p$ and $p^{\prime}$ are generated by drawing $n$ i.i.d samples, $V_{i}$ ’s and $V^{\prime}_{i}$ ’s, from $P_{V}$ and $P_{V^{\prime}}$ respectively:

[TABLE]

Intuitively speaking, as $P_{V}$ aims to generate $T$ -big distributions, we must ensure that the $V_{i}$ ’s are bounded away from $1/\beta$ , so that $p(i)\sim V_{i}/n$ has expected value higher than $T=1/(\beta n)$ . At the same time, we hope to maximize the probability that $V^{\prime}_{i}=0$ so that $p^{\prime}$ is far from any $T$ -big distribution, under the constraint that the first $L$ moments of $P_{V}$ and $P_{V^{\prime}}$ are exactly matched, as to ensure that the resulting distributions of histograms $\mathcal{H}$ and $\mathcal{H}^{\prime}$ are statistically close. The objective value of this optimization problem corresponds to the expected distance of $p^{\prime}$ to the closest $T$ -big distribution in total variation distance. To clarify the notation, $\lambda$ and $\nu$ are given to us. The unknown variables in $\mathrm{\mathbf{OP1}}$ are the PDFs $P_{V}$ and $P_{V^{\prime}}$ of two random variables $V$ and $V^{\prime}$ , respectively, as well as the scaling variable $\beta>0$ . The parameter $\lambda$ roughly specifies the ratio between the largest and the smallest non-zero probabilities that $p$ and $p^{\prime}$ can take.111 Note that $P_{V}$ and $P_{V^{\prime}}$ are on a continuous domain. However, $P_{V^{\prime}}$ will additionally have a non-negligible probability mass placed at value [math]. In fact, it turns out that in the optimal solution, $P_{V}$ and $P_{V^{\prime}}$ are only supported on a few distinct values ( $\Theta(L)=O(\log n)$ of them), so the optimal $P_{V}$ and $P_{V^{\prime}}$ assume the role of probability mass functions rather than PDFs.

[TABLE]

In the following lemma, we find the optimal value of $\mathrm{\mathbf{OP1}}$ . We use $\mathrm{\mathbf{OPT}}(A)$ to refer to the optimal value of optimization problem $A$ .

Lemma 4.1.

For any $\nu$ and $\lambda$ such that $0<1+\nu<\lambda$ , there exists a scaling parameter, $\beta$ , in $[1+\nu,\min(\lambda,1/\mathrm{\mathbf{OPT}}(\mathrm{\mathbf{OP1}}))]$ such that

[TABLE]

The proof of Lemma 4.1 is postponed to Section 4.1.

Let the value of $\beta$ be determined by the above lemma, and set $d$ to be $\mathrm{\mathbf{OPT}}(\mathrm{\mathbf{OP1}})$ .

Recall our wish list of five properties for the priors, $P_{V}$ and $P_{V}^{\prime}$ , that we propose in the introduction of Section 4. We define the following “good” events , which hold with high probability, to formalize the properties of the generated vectors $p$ and $p^{\prime}$ .

[TABLE]

and

[TABLE]

where $r$ is the number of elements $i$ such that $V^{\prime}_{i}$ is zero. Roughly speaking, these events state that $p=\frac{1}{n}(V_{i})_{i\in[n]}$ and $p^{\prime}=\frac{1}{n}(V^{\prime}_{i})_{i\in[n]}$ , generated in step 1, are approximate probability distributions (having total masses in $[1-\nu,1+\nu]=\Theta(1)$ ), and step 2 generates sufficient numbers of samples in the histogram (at least $s(1-\nu)/2=\Omega(s)$ each). Further, $p^{\prime}$ consists of as many as $r\geq\beta nd/2$ elements with probability mass [math], thus is at distance at least $rT\geq d/2$ from any $T$ -big distribution – we will set $d\geq 2\epsilon$ to reach the desired result.

In the following lemma, we show that conditioning on $E$ and $E^{\prime}$ , after running the process using the priors $P_{V}$ and $P_{V^{\prime}}$ , the generated histogram $h$ is a sufficiently large set of samples from a $1/(\beta n)$ -big distribution, and histogram $h^{\prime}$ is a sufficiently large set of samples from a distribution which is $\epsilon$ -far from any $1/(\beta n)$ -big distribution. In addition, the total variation distance between the distribution over $h$ ’s and $h^{\prime}$ ’s is bounded when $P_{V}$ , $P_{V^{\prime}}$ form a solution of $\mathrm{\mathbf{OP1}}$ . More precisely, let $\mathcal{H}$ denote the distribution over histograms $h$ generated by the process when the prior is $P_{V}$ , and let $\mathcal{H}_{E}$ be the distribution over histograms $h$ conditioning on $E$ . We define $\mathcal{H}^{\prime}$ and $\mathcal{H}_{E^{\prime}}$ similarly. In the following lemma, we bound the total variation distance between $\mathcal{H}_{E}$ and $\mathcal{H}^{\prime}_{E^{\prime}}$ as well.

Lemma 4.2.

Let $P_{V},P_{V^{\prime}}$ , and $\beta\in[1,1/d]$ form a solution of $\mathrm{\mathbf{OP1}}$ with objective value $d\geq 2\epsilon$ . Suppose $P_{V}$ and $P_{V^{\prime}}$ are the prior distributions to generate histograms $h$ and $h^{\prime}$ according to the process. Then, $h$ given event $E$ is a histogram of a set of at least $s(1-\nu)/2$ samples from a $1/(\beta n)$ -big distribution, whereas $h^{\prime}$ given $E^{\prime}$ is a histogram of a set of at least $s(1-\nu)/2$ samples that are drawn from a distribution which is $\epsilon$ -far from any $1/(\beta n)$ -big distribution. Moreover,

[TABLE]

Lastly, the largest probability mass among any elements in any probability distributions (from which the samples are drawn) is $\lambda/(n(1-\nu))$ .

The proof of Lemma 4.2 is given in Section 4.2.

Now, we assign the parameters, $\nu,\lambda$ , and $s$ , as follows:

[TABLE]

Recall that we set $d$ to be the optimal value of $\mathrm{\mathbf{OP1}}$ , and Lemma 4.1 tells us its value. We show that in this setting $d$ is at least $2\epsilon$ . Let $\rho$ be $\sqrt{\lambda/(1+\nu)}$ . Then, we have:

[TABLE]

as long as $\rho\geq 1.5$ . It is not hard to see that, for sufficiently large $n$ and $\epsilon\geq c/n$ for sufficiently large constant $c$ , then $\rho\geq 1.5$ holds, yielding $d\geq 2\epsilon$ ,

for every $\epsilon\leq c_{0}$ , where $c_{0}<1/2$ is a constant.

Let $\mathcal{H}^{+}$ and $\mathcal{H}^{-}$ be $\mathcal{H}_{E}$ and $\mathcal{H}^{\prime}_{E^{\prime}}$ respectively. By Lemma 4.2, the total variation distance between $\mathcal{N}^{+}$ and $\mathcal{N}^{-}$ is at most $0.01$ , while $s$ and $p_{\textrm{max}}$ behave according to the claimed respective asymptotic bounds. Hence, the proof is complete. ∎

An important case of Theorem 3.1 is when $L=\Theta(\log n)$ , where we establish a near-linear sample complexity lower bound of $\Omega(n/\log n)$ for the general problem of bigness testing as follows.

\bignessTest

Proof.

By Theorem 3.1, there exist $\mathcal{H}^{+}$ and $\mathcal{H}^{-}$ with the aforementioned properties. Any $1/(\beta n)$ -bigness tester has to distinguish between $\mathcal{H}^{+}$ and $\mathcal{H}^{-}$ with probability at least 2/3. On the other hand, the total variation distance between $\mathcal{H}^{+}$ and $\mathcal{H}^{-}$ is at most 0.01. Therefore, no algorithm can distinguish between them while receiving $s/4=\Theta(n\log^{2}(1/\epsilon)/\log n)$ samples with probability more than $(1+0.01)/2$ . Therefore, testing $1/(\beta n)$ -bigness requires $\Omega(n\log^{2}(1/\epsilon)/\log n)$ samples.

Note that in the proof of Theorem 3.1, $\beta$ is determined by Lemma 4.1, and it is bounded by $1/\epsilon$ . Thus, if $\epsilon$ is a constant then $\beta$ is also a constant. Thus, the required sample complexity becomes $\Omega(n/\log n)$ . ∎

4.1 Proof of Lemma 4.1

See 4.1

Proof.

To prove the lemma, we introduce an auxiliary linear program ( $\mathrm{\mathbf{LP2}}$ ) that is known to have an optimal value of the right hand side of the above equation. We prove the $\mathrm{\mathbf{LP2}}$ has the same optimal objective value as $\mathrm{\mathbf{OP1}}$ to prove the lemma. For two given parameters $\nu$ and $\lambda$ , we define the following LP over two random variables $X,X^{\prime}$ .

[TABLE]

To interpret this LP, assume the unknown variable is the $PDF$ ’s of the random variables $X$ and $X^{\prime}$ . Thus, for any number $x$ in $[1+\nu,\lambda]$ , we want to find $P_{X}(x)$ and $P_{X^{\prime}}(x)$ . Note that this optimization problem is linear since all the expectations above are a linear function of $P_{X}$ and $P_{X^{\prime}}$ . Moreover, there is an implicit constraint here that the integral of $P_{X}$ and $P_{X^{\prime}}$ should be one since they are probability distributions.

Observe that there exists a trivial solution where $X$ and $X^{\prime}$ are two identically-distributed random variables, so $\mathrm{\mathbf{LP2}}$ is feasible and its optimal objective value is at least zero. Let $\mathcal{X}^{*}$ and $\mathcal{X}^{\prime*}$ be a pair of random variables forming an optimal solution for $\mathrm{\mathbf{LP2}}$ , and let $\beta^{*}=1/\mathrm{\mathbf{E}}[1/\mathcal{X}^{*}]$ . Since all $X$ and $X^{\prime}$ are in $[1+\nu,\lambda]$ , then $\beta^{*}$ is also in $[1+\nu,\lambda]$ . On the other hand, since $\mathcal{X}^{\prime*}$ is positive and bounded, then $\mathrm{\mathbf{E}}[1/\mathcal{X}^{\prime*}]>0$ and thus $\mathrm{\mathbf{E}}[1/\mathcal{X}^{*}]>\mathrm{\mathbf{OPT}}(\mathrm{\mathbf{LP2}})$ ; hence $\beta^{*}$ is at most $1/\mathrm{\mathbf{OPT}}(\mathrm{\mathbf{LP2}})$ .

Now, we argue that $\mathrm{\mathbf{LP2}}$ and $\mathrm{\mathbf{OP1}}$ have the same optimal value. We introduce two new random variables $\mathcal{V}^{*}$ and $\mathcal{V}^{\prime*}$ with the following PDFs, and later we show they form an optimal solution for $\mathrm{\mathbf{OP1}}$ .

[TABLE]

In the above equations, with a slight abuse of notation we say that $1/v$ is zero for $v=0$ ; that is, the probability mass for $v=0$ is given by the respective second terms. Since ${\beta^{*}}$ is defined to be $1/\mathrm{\mathbf{E}}[1/\mathcal{X}^{*}]$ , the second term in $P_{\mathcal{V}^{*}}$ is zero for all $v$ in particular for $v=0$ . We define our notation in this fashion in order to make the calculations for $\mathcal{V}^{*}$ and $\mathcal{V}^{\prime*}$ analogous, so we may write our proof compactly.

Now, we show that the proposed variables $\mathcal{V}^{*}$ , $\mathcal{V}^{\prime*}$ and ${\beta^{*}}$ form a feasible solution for $\mathrm{\mathbf{OP1}}$ . First, we show that the domain of $\mathcal{V}^{*}$ and $\mathcal{V}^{\prime*}$ are as stated in the definition of $\mathrm{\mathbf{OP1}}$ in Equation 1. Then, we show $P_{\mathcal{V}^{*}}$ and $P_{\mathcal{V}^{\prime*}}$ are probability distribution, and we prove the constraints of $\mathrm{\mathbf{OP1}}$ hold as well.

First, consider the domain of the random variables. Clearly the domain does not include the numbers where the PDF is zero, so we prove that the $P_{\mathcal{V}^{*}}$ and $P_{{\mathcal{V}^{\prime}}^{*}}$ are (potentially) non-zero only when when $\mathcal{V}^{*}$ and $\mathcal{V}^{\prime*}$ are in the rage specified by the domain constraints of the $\mathrm{\mathbf{OP1}}$ . Recall that the second term in $P_{\mathcal{V}^{*}}$ is always zero. Thus, $P_{\mathcal{V}^{*}}$ could be potentially non-zero only if $x$ equal to $\beta v$ has a non-zero error probability according to $P_{\mathcal{X}^{*}}$ . Therefore, $\mathcal{V}^{*}$ is always in $[(1+\nu)/{\beta^{*}},\lambda/{\beta^{*}}]$ . For $\mathcal{V}^{\prime*}$ , in addition to the value $v\in[(1+\nu)/{\beta^{*}},\lambda/{\beta^{*}}]$ , $v$ could be zero as well since the second term in the definition of $P_{\mathcal{V}^{\prime*}}$ may be non-zero at $v=0$ . Thus, $\mathcal{V}^{\prime*}$ is always in $\{0\}\cup[(1+\nu)/{\beta^{*}},\lambda/{\beta^{*}}]$ .

In addition, $P_{\mathcal{V}^{*}}$ (and similarly $P_{{\mathcal{V}^{\prime}}^{*}}$ ) is a probability distribution since the integral of the PDF is one:

[TABLE]

where the second equality is derived by substituting $v$ with $x/{\beta^{*}}$ .

Now, we focus on the constraints of $\mathrm{\mathbf{OP1}}$ . The first constraint is $\mathrm{\mathbf{E}}[\mathcal{V}^{*}]=\mathrm{\mathbf{E}}[\mathcal{V}^{\prime*}]=1$ . Below we show that the expected value of $\mathcal{V}^{*}$ is $1$ .

[TABLE]

One can similarly show that $\mathrm{\mathbf{E}}[{\mathcal{V}^{\prime}}^{*}]=1$ , and the constraint holds.

The second constraint is that the first $L$ moments of $\mathcal{V}^{*}$ and $\mathcal{V}^{\prime*}$ are matched: $\mathrm{\mathbf{E}}[{\mathcal{V}^{*}}^{j}]=\mathrm{\mathbf{E}}[{\mathcal{V}^{\prime*}}^{j}]$ for $j$ in $[L]$ . The previous constraint implies that the first moments, $\mathrm{\mathbf{E}}[{\mathcal{V}}^{*}]$ and $\mathrm{\mathbf{E}}[{\mathcal{V}^{\prime}}^{*}]$ , are equal, so here we focus on the second and higher moments. Fix $j$ in $\{2,\ldots,L\}$ . For the $j$ -th moment of $\mathcal{V}^{*}$ , we have:

[TABLE]

We can similarly show the same condition for $\mathrm{\mathbf{E}}[{\mathcal{V}^{\prime*}}^{j}]$ . Since $\mathcal{X}^{*}$ and $\mathcal{X}^{\prime*}$ satisfies the moment matching constraints of $\mathrm{\mathbf{LP2}}$ , we derive the moment matching constraints of $\mathrm{\mathbf{OP1}}$ as follows:

[TABLE]

Therefore, $\mathcal{V}^{*}$ , $\mathcal{V}^{\prime*}$ and ${\beta^{*}}$ form a feasible solution for $\mathrm{\mathbf{OP1}}$ . Thus, the objective function according to $\mathcal{V}^{*}$ , $\mathcal{V}^{\prime*}$ is at most the optimal value of $\mathrm{\mathbf{OP1}}$ :

[TABLE]

On the other hand, the objective value of $\mathrm{\mathbf{OP1}}$ and $\mathrm{\mathbf{LP2}}$ are the same on the two solutions we discussed:

[TABLE]

where the last equality is true, since we chose $X$ and $X^{\prime}$ to be the optimal solution of $\mathrm{\mathbf{LP2}}$ at the beginning.

[TABLE]

We continue the proof by showing that the above inequality is true in the other direction, i.e, $\mathrm{\mathbf{OPT}}(\mathrm{\mathbf{OP1}})$ is at most $\mathrm{\mathbf{OPT}}(\mathrm{\mathbf{LP2}})$ . Let $P_{\mathcal{V}}$ , $P_{\mathcal{V}^{\prime}}$ and $\beta$ form a feasible solution for $\mathrm{\mathbf{OP1}}$ . We define random variables $\mathcal{X}$ and $\mathcal{X}^{\prime}$ with the following PDFs, and show that they form a feasible solution for $\mathrm{\mathbf{LP2}}$ in Equation 2 with the same objective value as $\mathcal{V}$ and $\mathcal{V}^{\prime}$ in the $\mathrm{\mathbf{OP1}}$ :

[TABLE]

First, we show that the domain of $\mathcal{X}$ and $\mathcal{X}^{\prime}$ matches with the domain constraint in $\mathrm{\mathbf{LP2}}$ . Similar to the previous part, we prove that the PDF’s are zero outside the interval specified by the domain constraint $[1+\nu,\lambda]$ . Observe that $P_{\mathcal{X}}(x)$ is non-zero if and only if $x$ and $P_{\mathcal{V}}(x/\beta)$ are both non-zero, so $x/\beta$ has to be in $[(1+\nu)/\beta,\lambda/\beta]$ . Thus, the domain of the random variable $\mathcal{X}$ (and similarly $\mathcal{X}^{\prime}$ ) is $[1+\nu,\lambda]$ .

Moreover, note that $P_{\mathcal{X}}$ (and similarly $P_{\mathcal{X}^{\prime}}$ ) is a probability distribution:

[TABLE]

where the equation is derived by replacing $x/\beta$ with a new variable $v$ . Now, we show that the constraints of $\mathrm{\mathbf{LP2}}$ are satisfied for $\mathcal{X}$ and $\mathcal{X}^{\prime}$ . Fix $j\in[L-1]$ . We show the $j$ -th moment of $\mathcal{X}$ and $\mathcal{X}^{\prime}$ are equal:

[TABLE]

Similarly, one can show $\mathrm{\mathbf{E}}[{\mathcal{X}^{\prime}}^{j}]$ is equal to $\beta^{j}\mathrm{\mathbf{E}}[{\mathcal{V}^{\prime}}^{j+1}]$ . Since the pair $\mathcal{V}$ and $\mathcal{V}^{\prime}$ satisfies the moment matching constraints of $\mathrm{\mathbf{OP1}}$ , then $\mathrm{\mathbf{E}}[\mathcal{V}^{j+1}]$ is equal to $\mathrm{\mathbf{E}}[{\mathcal{V}^{\prime}}^{j+1}]$ . Therefore, $\mathrm{\mathbf{E}}[\mathcal{X}^{j}]$ is equal to $\mathrm{\mathbf{E}}[{\mathcal{X}^{\prime}}^{j}]$ .

Now, we focus on the objective functions of the $\mathrm{\mathbf{OP1}}$ and $\mathrm{\mathbf{LP2}}$ . We have:

[TABLE]

Now that for any feasible solution of $\mathrm{\mathbf{OP1}}$ , there exists a feasible solution for $\mathrm{\mathbf{LP2}}$ with the same objective value, one can conclude that the optimal value of $\mathrm{\mathbf{OP1}}$ , $\mathrm{\mathbf{OPT}}(\mathrm{\mathbf{OP1}})$ , is at most $\mathrm{\mathbf{OPT}}(\mathrm{\mathbf{LP2}})$ . Thus by Equation 3, we have:

[TABLE]

which implies that $\mathcal{V}^{*}$ , $\mathcal{V}^{\prime*}$ and $\beta^{*}$ also form an optimal solution for $\mathrm{\mathbf{OP1}}$ , and hence $\mathrm{\mathbf{OPT}}(\mathrm{\mathbf{OP1}})$ and $\mathrm{\mathbf{OPT}}(\mathrm{\mathbf{LP2}})$ are equal. This also implies that $\beta^{*}$ is at most $1/\mathrm{\mathbf{OPT}}(\mathrm{\mathbf{OP1}})$ .

In Appendix E of [WY16b], Wu and Yang proved that an optimal solution of $\mathrm{\mathbf{LP2}}$ can be obtained through the best polynomial approximation of the function $1/x$ . More formally, they showed that there exists a solution for $\mathrm{\mathbf{LP2}}$ with the following optimal value:

[TABLE]

where $\mathcal{P}_{L-1}$ is the set of all degree $L-1$ polynomials. The optimal polynomial approximation error have been studied in [KVZ12] and in Sec. 2.11.1 of [Tim63]. They computed the maximum error of the best degree $L-1$ polynomial approximation. More precisely, we have:

[TABLE]

Hence, the proof is complete. ∎

4.2 Proof of Lemma 4.2

Before stating the lemma, we review the definitions we used so far. Recall that $p$ and $p^{\prime}$ are generated by drawing $n$ i.i.d samples, $V_{i}$ ’s and $V^{\prime}_{i}$ ’s, from $P_{V}$ and $P_{V^{\prime}}$ respectively:

[TABLE]

and $E$ and $E^{\prime}$ where the desired events:

[TABLE]

and

[TABLE]

where $r$ was the number of elements $i$ for which $V^{\prime}_{i}$ is zero. We generate histograms $h$ and $h^{\prime}$ according to $p$ and $p^{\prime}$ respectively. let $\mathcal{H}$ denote the distribution over histograms $h$ generated by the process when the prior is $P_{V}$ , and let $\mathcal{H}_{E}$ be the distribution over histograms $h$ conditioning on $E$ . We define $\mathcal{H}^{\prime}$ and $\mathcal{H}_{E^{\prime}}$ similarly. In the following lemma, we prove “good properties” for $p$ and $p^{\prime}$ after normalization and also bound the total variation distance between $\mathcal{H}_{E}$ and $\mathcal{H}^{\prime}_{E^{\prime}}$ .

See 4.2

Proof.

First, we show given event $E$ , the normalization of $p$ is $1/(\beta n)$ -big distribution. From $\mathrm{\mathbf{OP1}}$ , we know that the $V_{i}$ ’s are in $[(1+\nu)/\beta,\lambda/\beta]$ , and the $V^{\prime}_{i}$ ’s are in $\{0\}\cup[(1+\nu)/\beta,\lambda/\beta]$ . Observe that $p(i)$ after normalization is at least the following:

[TABLE]

where the last inequality is due to the fact that $\sum_{j}V_{j}/n$ is at most $1+\nu$ . Thus, the normalization of $p$ is $1/(\beta n)$ -big. On the other hand, we can achieve the same lower bound for the normalized value of $p^{\prime}(i)$ when $V^{\prime}_{i}$ is not zero, so the normalization of $p^{\prime}$ places either probability mass zero, or at least $1/(\beta n)$ , on each element. Similarly, the maximum probability mass among the normalization of $p$ ’s and $p^{\prime}$ ’s is at most

[TABLE]

because $\beta\geq 1$ , yielding the desired bound on the maximum probability mass.

Next, we show that given $E^{\prime}$ , the normalization $p^{\prime}$ is $\epsilon$ -far from any big distribution. Note that if $V^{\prime}_{i}$ is zero, then probability $p^{\prime}(i)$ even after normalization remains zero. So, there are exactly $r$ elements that have probability mass zero and the rest (based on above argument) each have probability mass at least $1/(\beta n)$ . Thus, the total variation distance to $1/(\beta n)$ -bigness is at least $r/(\beta n)$ , and given $E^{\prime}$ it is at least $d/2\geq\epsilon$ .

Now, we show the distance between $\mathcal{H}_{E}$ and $\mathcal{H}^{\prime}_{E^{\prime}}$ is bounded. By the triangle inequality we have:

[TABLE]

where the superscript $c$ for the events, $E$ and $E^{\prime}$ indicates the complimentary event. Now, we start with bounding the probability of the complementary events of $E$ and $E^{\prime}$ from above to show that they happen with small probability. Since the $V_{i}$ ’s (and similarly the $V^{\prime}_{i}$ ’s) are independently drawn from $P_{V}$ with expected value $1$ , and they are in the range $[0,\lambda/\beta]$ , then by the Chebyshev inequality, we have:

[TABLE]

Recall that the $d$ was the optimal value of $\mathrm{\mathbf{OP1}}$ . Thus, $\Pr[V^{\prime}_{i}=0]$ is $\beta d$ . Moreover, $r$ , the number of the $V^{\prime}_{i}$ ’s that are zero, is a Binomial random variable with $\mathrm{\mathbf{E}}[r]=n\cdot\Pr[V^{\prime}_{i}=0]$ which is $\beta nd$ . Thus, by the Chernoff bound, we have:

[TABLE]

Finally, we show that the total number of samples is high with high probability. Assume we already have $\sum_{i=1}^{n}V_{i}/n$ is at least $1-\nu$ . Then the total number of samples $\sum_{i=1}^{n}h_{i}$ is a Poisson random variable with mean $t\coloneqq s\sum_{i=1}^{n}V_{i}\geq s(1-\nu)$ . By the tail bound for Poisson distributions proved in [Can17]222If $X$ is a Poisson random variable with mean $\lambda$ , then for any $t>0$ , we have $\Pr\left[X\leq\lambda-t\right]\leq\exp\left({-\frac{t^{2}}{\lambda+t}}\right)$ , we have

[TABLE]

One can achieve a similar result for $\sum_{i=1}^{n}h^{\prime}_{i}$ .

Now, we continue bounding the distance between $\mathcal{H}_{E}$ and $\mathcal{H}^{\prime}_{E^{\prime}}$ . $\mathcal{H}^{(i)}$ (and similarly ${\mathcal{H}^{\prime}}^{(i)}$ ) indicates the distribution over the $i$ -th coordinate of the histogram, $h_{i}$ . By the previous inequality, we have:

[TABLE]

where the last inequality follows from the fact that the first $L$ moments of $P_{V}$ and $P_{V^{\prime}}$ are matched, by Lemma 6 in [WY16a], we have:

[TABLE]

Hence, the proof is complete. ∎

5 From Bigness to Monotonicity

In this section, we show how to turn our lower bound results for bigness testing problem in the previous section, into lower bounds for monotonicity testing in some fundamental posets, namely the matching poset and the Boolean hypercube poset. See Section 3.2 for the proof overviews.

5.1 Monotonicity testing on a matching poset

Theorem 5.1.

Consider the pair of distributions $\mathcal{N}^{+}$ , $\mathcal{N}^{-}$ for the bigness problem as specified in Theorem 3.1 with bigness threshold $T=O(1/n)$ , number of samples $s$ , and maximum probability $p_{\textrm{max}}$ . There exists a distribution on a matching of size $n$ with maximum probability $p_{\textrm{max}}^{\prime}=\Theta(p_{\textrm{max}})$ such that testing, with success probability $2/3$ , whether a matching randomly drawn from such a distribution is monotone or $\epsilon^{\prime}=\Theta(\epsilon)$ -far from any monotone distribution, requires $s^{\prime}=\Omega(s)$ samples.

Proof.

Let $U=\{u_{1},u_{2},\ldots,u_{n}\}$ and $V=\{v_{1},v_{2},\ldots,v_{n}\}$ form the vertex set of a directed matching $M_{n}$ of size $n$ where the edges are $(v_{i},u_{i})$ ’s for $i=1,2,\ldots,n$ . Consider the distribution over the matching poset $G=(U\cup V,\{(v_{i},u_{i})|i\in[n]\})$ ; more specifically, the distribution is monotone if and only if the probabilities $p(u_{i})\geq p(v_{i})$ for all $i$ . We apply the Poissonization technique, then prove our lower bound by contradiction: assume there exist an algorithm $\mathcal{A}$ which tests monotonicity of distributions over the matching of size $n$ using $\mathrm{\mathbf{Poi}}(s^{\prime})$ samples where $s^{\prime}=o(s)$ and successfully distinguishes whether the distribution is monotone or $\epsilon^{\prime}\coloneqq\epsilon/(2(1+nT))$ -far from monotone with probability at least 2/3. To reach the desired contradiction, we turn these samples into $s^{\prime}(1+nT)$ samples for the $T$ -bigness testing problem, and show that one can test $T$ -bigness using $\mathcal{A}$ as a black-box tester. Note that $T=O(1/n)$ , so the factor $1+nT$ is $\Theta(1)$ in this proof.

Assume we have a distribution, $p$ , over $[n]$ elements for which we wish to test the bigness property. We construct a distribution $q_{p}$ over a matching over $U\cup V$ based on $p$ as follows:

[TABLE]

Clearly the maximum probability of $q_{p}$ is at most $p_{\textrm{max}}^{\prime}\coloneqq p_{\textrm{max}}/(1+nT)$ . Next we show the changes in distances to monotonicity. Next we show the difference in distance to monotonicity from the case that $p$ is $T$ -big and the case that $p$ is $\epsilon$ -far from $T$ -big. If $p$ is a $T$ -big distribution, then $q_{P}(u_{i})\geq T/(1+nT)\geq q_{p}(v_{i})$ and thus $q_{p}$ is monotone.

Next, if $p$ is $\epsilon$ -far from any $T$ -big distribution, then we show that $q_{p}$ is $\epsilon/(2(1+nT))$ -far from any monotone distribution. Let $S$ be the set of elements for which $p(i)<T$ . Clearly, to make $p$ a $T$ -big distribution, one has to increase all the $p(i)$ to $T$ for $i\in S$ and there is no need to increase the probability of any other elements. Therefore, the total variation distance to of $p$ to $\mathsf{Big}(n)$ is exactly $\sum_{i\in S}T-p(i)$ assuming $T\leq 1/n$ . Let $q^{\prime}$ be the closest monotone distribution to $q_{p}$ , and observe that $q^{\prime}(u_{i})\geq q^{\prime}(v_{i})$ . We compute:

[TABLE]

Finally we show that the assumed algorithm $\mathcal{A}$ may be used to test the $T$ -bigness property of $p$ . Suppose we are given access to $\mathrm{\mathbf{Poi}}(s^{\prime})$ independent samples from the distribution $p$ for which we want to test $T$ -bigness property. We construct a distribution $q_{p}$ as described above: to obtain $\mathrm{\mathbf{Poi}}(s^{\prime}(1+nT))$ samples from $q_{p}$ , for each $i\in[n]$ , we create $\mathrm{\mathbf{Poi}}(s^{\prime}\cdot p(i))$ and $\mathrm{\mathbf{Poi}}(s^{\prime}\cdot T)$ samples of $u_{i}$ and $v_{i}$ respectively. The $\mathrm{\mathbf{Poi}}(s^{\prime}\cdot p(i))$ samples for each $i$ of the $u_{i}$ ’s may be obtained by substituting each element $i$ from $p$ with $u_{i}$ in $\mathrm{\mathbf{Poi}}(s^{\prime})$ samples from $p$ , whereas $\mathrm{\mathbf{Poi}}(s^{\prime}\cdot T)$ samples for $v_{i}$ ’s may be generated directly by drawing $v_{i}$ ’s uniformly at random. Thus, using $\mathrm{\mathbf{Poi}}(s^{\prime})$ samples from $p$ , one can construct $\mathrm{\mathbf{Poi}}(\Omega(s^{\prime}))$ samples from $q_{p}$ and use $\mathcal{A}$ for testing the monotonicity of the matching poset $q_{p}$ , which corresponds to testing the $T$ -bigness of $p$ , yielding a contradiction by the fact that bigness testing requires $\Omega(s)$ samples by Theorem 3.1. ∎

This result, applied with Theorem 3.1 using $L=\Theta(\log n)$ (where $s=\Omega\left(n\ln^{2}(1/\epsilon)/\log n\right)$ , $p_{\textrm{max}}=O((\log^{2}n)/(n\ln^{2}(1/\epsilon)))$ and $T=1/(\beta n)\in[\epsilon/n,1/n]$ ), immediately yields the following lower bound for the testing monotonicity in a matching poset.

Corollary 5.2.

For sufficiently small parameter $\epsilon=\Omega(1/n)$ , any algorithm that can distinguish whether a distribution over a matching poset on $2n$ vertices is monotone, or $\epsilon$ -far from any monotone distribution, with probability $2/3$ requires $\Omega((n\ln^{2}(1/\epsilon))/\log n)$ samples. Moreover, the maximum probability mass of the distribution in the lower bound construction can be bounded above by $O((\log^{2}n)/(n\ln^{2}(1/\epsilon)))$ .

5.2 Monotonicity testing on a hypercube poset

Consider the Boolean hypercube poset $\{0,1\}^{d}$ with $N=2^{d}$ vertices. For convenience, let $\mathcal{C}$ and $\mathcal{S}$ denote the distribution of distributions implicitly constructed in the lower bound of Theorem 5.1, where distributions in $\mathcal{C}$ are monotone, and distributions in $\mathcal{S}$ are $\epsilon$ -far from any monotone distribution, respectively. Theorem 5.1 shows that randomly-drawn distributions from $\mathcal{C}$ and $\mathcal{S}$ generate statistically similar histograms over the matching poset. For simplicity, we do not distinguish the parameters $\epsilon$ , $s$ and $p_{\textrm{max}}$ in Theorem 3.1 and Theorem 5.1 as they are equivalent up to a constant factor.

5.2.1 General lower bound for monotonicity testing on a hypercube poset

We first establish the theorem that describes the result of the outlined embedding approach, then later apply this result to achieve interesting special cases.

Theorem 5.3.

Let an integer $\ell\geq 1$ be a parameter. Suppose that there exists a pair $(\mathcal{C},\mathcal{S})$ of distribution of distributions over a matching on $n={\binom{d-1}{\ell-1}}$ pairs of vertices, forming an instance for the monotonicity problem with distance $\epsilon$ , a maximum probability $p_{\textrm{max}}$ , and a lower bound of $s$ samples. Then, testing monotonicity on the Boolean hypercube of size $N=2^{d}$ with distance parameter $\epsilon/W$ requires $\Omega(sW)$ samples, where $s=\Omega((n\ln^{2}(1/\epsilon))/\log n)$ and $W=1+\Theta((\log^{2}n)/(n\ln^{2}(1/\epsilon)))\cdot\left(\sum_{i=\ell}^{d}{\binom{d}{i}}-{\binom{d-1}{\ell-1}}\right)$ .

Proof.

Consider two consecutive levels $\ell$ and $\ell-1$ of a hypercube, where the $\ell^{\textrm{th}}$ level consists of vertices whose coordinates contain exactly $\ell$ ones. Our approach is to embed our matching onto these levels in the hypercube, so that each edge of the matching has one endpoint in each of the two levels, and each endpoint is mutually incomparable to any endpoint of any other edge.

We choose our coordinates for the embedding as follows. We pick all the vertices such that there are exactly $\ell-1$ ones among the first $d-1$ coordinates. Let $M$ denote the set of these vertices. There are exactly $2\cdot{{d-1}\choose{\ell-1}}$ vertices in the set $M$ . Clearly, each vertex in $M$ is comparable with the vertex whose coordinate only differs at the last bit. Furthermore, it is incomparable with the rest of the vertices in $M$ , as other coordinates also have $\ell-1$ ones on the first $d-1$ bits.

Next we describe the probabilities assigned to each vertex on the hypercube, given $p$ , the distribution over a matching (drawn from $\mathcal{C}$ or $\mathcal{S}$ ). First we assign the probabilities to $M$ according to $p$ . Namely, the set of coordinates of $M$ with $\ell$ ones corresponds to $U$ and that with $\ell-1$ ones corresponds $V$ , where $U$ and $V$ are as defined in the previous proof. Then, for the remaining vertices in level $\ell$ and above, assign the probability of $c\cdot((\log^{2}n)/(n\ln^{2}(1/\epsilon)))$ for a sufficiently large $c$ such that the quantity becomes at least $p_{\textrm{max}}$ . Let $W=1+\Theta((\log^{2}n)/(n\ln^{2}(1/\epsilon)))\cdot\left(\sum_{i=\ell}^{d}{d\choose i}-{{d-1}\choose{\ell-1}}\right)$ be the total probability assigned to all these vertices so far. We divide all assigned probabilities by $W$ to finally obtain a distribution over the hypercube. We denote the constructed distribution over the hypercube $p_{H}$ .

Clearly, the proposed construction preserves the monotonicity due to the incomparability between distinct embedded matching edges. In particular, if distribution over the matching is drawn from $\mathcal{C}$ , the distribution over the hypercube will still be monotone; if it is drawn from $\mathcal{S}$ , then the distance to monotonicity is now $\epsilon/W$ since, at the very least, the subposet restricted to the embedded matching must be modified to a monotone distribution over this matching.

Using Corollary 5.2, any algorithm that can test the monotonicity of $p_{H}$ requires $\Omega(s)$ samples from the matching vertices. Note that if we draw a sample from $p_{H}$ with probability $1/W$ it is from the matching. Therefore, observe that $\mathrm{\mathbf{Poi}}(s)$ samples from the matching are required in order to obtain $\mathrm{\mathbf{Poi}}(sW)$ samples from the hypercube with high probability. This yields the lower bound of $\Omega(sW)$ samples for testing monotonicity over the hypercube poset. ∎

5.2.2 Applications of Theorem 5.3

We extend Theorem 5.3 into two following corollaries. Firstly, we consider embedding our matching to the largest possible levels of the hypercube, namely the middle ones, showing the lower bound of $\Omega(nd)$ samples for $\epsilon=\Theta(1/d^{2.5})$ (Corollary 5.4). To complement this first corollary that only handles sub-constant $\epsilon$ , we secondly apply our construction to higher levels of the hypercube, and readjust the construction from Theorem 3.1 so that $L=\Theta(1)$ moments are matched (as opposed to $\Theta(\log n)$ ). This approach shows the lower bound of $\Omega(N^{1-\delta})$ for testing monotonicity on the hypercube poset with distance parameter $\epsilon$ , such that $\delta\rightarrow 0$ as $\epsilon\rightarrow 0$ (Corollary 5.5).

Corollary 5.4.

For sufficiently small $\epsilon=\Theta(1/d^{2.5})$ , any algorithm that can distinguish whether a distribution over a Boolean hypercube poset of size $N=2^{d}$ is monotone, or $\epsilon$ -far from any monotone distribution, with success probability $2/3$ requires $\Omega(Nd)$ samples.

Proof.

Let $\ell$ be $\lceil{d/2}\rceil$ . As we stated in the proof of Theorem 5.3, we embed a matching of size $n\coloneqq{d-1\choose\ell-1}$ onto the middle layer of the hypercube where $n$ is at least $\Omega(N/\sqrt{d})=\Omega(N/\sqrt{\log N})$ by Stirling’s approximation. We have

[TABLE]

Applying Theorem 5.3, we achieve our lower bound of $\Omega(Nd)$ for $\epsilon=\Theta(1/d^{2.5})$ by choosing a sufficiently small constant $\epsilon^{\prime}$ . ∎

Corollary 5.5.

Any algorithm that can distinguish whether a distribution over a Boolean hypercube poset of size $N=2^{d}$ is monotone, or $\epsilon$ -far from any monotone distribution, with success probability $2/3$ requires $\Omega(N^{1-\delta})$ samples, where $\epsilon$ and $\delta=\Theta(\sqrt{\epsilon})+o(1)$ are constants. In particular, $\delta\rightarrow 0$ as $\epsilon\rightarrow 0$ .

Proof.

Without loss of generality assume $d$ is even. Otherwise, observe that when $d$ is odd, we may embed a hypercube of size $2^{d}$ in a hypercube of size $2^{d+1}$ and achieve the same lower bound up to a constant factor. Consider $\ell\geq d/2$ . Observe that

[TABLE]

This yields the inequality

[TABLE]

We pick $\ell=d/2+\alpha d$ for some constant $0.24>\alpha>0$ so that $\sum_{i=\ell}^{d}{d\choose i}=\Theta\left({d\choose\ell}\right)$ . The embedded matching is of size $n={{d-1}\choose{\ell-1}}=\frac{d}{\ell}{d\choose\ell}=\Theta\left({d\choose\ell}\right)$ .

Next, consider the application of Theorem 5.1 leveraging Theorem 3.1 with constant parameters $\epsilon$ and $L$ , yielding the lower bound of $s=\Omega(n^{1-1/L}/L)$ samples for $p_{\textrm{max}}=O(L^{2}/n)=O(L^{2}/{d\choose\ell})$ . We compute $W=1+\Theta\left(L^{2}/{d\choose\ell}\right)\cdot\Theta\left({d\choose\ell}\right)=\Theta(L^{2})$ . Applying Theorem 5.3, we achieve the lower bound of $\Omega(n^{1-\frac{1}{L}}L)$ for testing monotonicity over the hypercube with $\epsilon=\Theta(1/L^{2})$ .

Recall that $\ell=d/2+\alpha d$ . Using a similar argument as above, we can also bound

[TABLE]

establishing the lower bound of $\widetilde{\Omega}(N^{(1+\alpha\log(1-4\alpha))(1-\frac{1}{L})})=\Omega(N^{1-\delta})$ for testing monotonicity over the hypercube poset, where $\delta=1/L-\alpha\log(1-4\alpha)+o(1)$ . Since $\epsilon=\Theta(1/L^{2})$ , for sufficiently large $N$ , we may choose sufficiently small $\alpha$ and large $L$ , so that $\delta=\Theta(\sqrt{\epsilon})+o(1)$ , as desired. ∎

6 Reduction from General Posets to Bipartite Graphs

In this section, we show that the problem of monotonicity testing of distributions over the bipartite posets is essentially the “hardest” case of monotonicity testing in general poset domains. That is, we show that for any distribution $p$ over some poset domain of size $n$ , represented as a directed graph $G$ , there exists a distribution $p^{\prime}$ over a bipartite poset $G^{\prime}$ of size $2n$ such that (1) $p$ preserves the total variation distance of $p$ to monotonicity up to a small multiplicative constant factor, and (2) each sample for $p^{\prime}$ can be generated using one sample drawn from $p$ . These properties together imply the following main theorem of this section.

\generaltobipartite

Proof.

Consider an arbitrary poset described as a directed graph $G=(V,E)$ , and an associated probability distribution $p$ over $V$ . We construct a bipartite graph $G^{\prime}=(V^{\prime},E^{\prime})$ based on the transitive closure of $G$ , denoted by $TC(G)$ , and a distribution $p^{\prime}$ over $V^{\prime}$ such that testing the monotonicity of $p$ over $V$ is roughly equivalent to testing the monotonicity of $p^{\prime}$ over $V$ .

The construction of the bipartite $G^{\prime}=(V^{\prime},E^{\prime})$ is as follows: for each $v\in V$ , we add two vertices $v^{+}$ and $v^{-}$ to $V^{\prime}$ , so that $S\coloneqq\{v^{+}\}_{v\in V}$ and $T\coloneqq\{v^{-}\}_{v\in V}$ together form the bipartition $V^{\prime}\coloneqq S\cup T$ . Think of $S$ and $T$ as the set of top and bottom vertices respectively. Next, consider two vertices $u$ and $v$ such that there is a path from $u$ to $v$ in $G$ (i.e., $(u,v)$ is an edge in $TC(G)$ ). For every such pair, we add the directed edge $(u^{-},v^{+})$ to $E^{\prime}$ . Given the distribution $p$ over $V$ , we set $p^{\prime}(v^{+})=p^{\prime}(v^{-})=p(v)/2$ . Observe that we can generate a sample from $p^{\prime}$ using a sample from $p$ : if $v$ is drawn from $p$ , a sample for $p^{\prime}$ is obtained by picking either $v^{+}$ or $v^{-}$ , each with probability $1/2$ .

Now, we prove that testing monotonicity of $p$ is equivalent to testing monotonicity of $p^{\prime}$ . If $p$ is monotone, then $p^{\prime}$ is also monotone: for each $(u^{-},v^{+})\in E^{\prime}$ , $p(u)\leq p(v)$ via the transitivity of monotonicity of $p$ along the $u$ - $v$ path on $G$ . So, $p^{\prime}(u^{-})=p(u)/2\leq p(v)/2=p^{\prime}(v^{+})$ .

Next, suppose $p$ is $\epsilon$ -far from $p^{\prime}$ . By Lemma 6.1 (shown below), there exists a (directed) matching $M$ in $TC(G)$ , such that

[TABLE]

Then, the set of edges $(u^{-},v^{+})$ ’s corresponding to $(u,v)\in M$ also forms a matching, $M^{\prime}$ , on $G^{\prime}$ . Let $p^{\prime*}$ be the monotone distribution on $G^{\prime}$ closest to $p^{\prime}$ . Since $p^{\prime*}$ is a monotone distribution, for an edge $(u^{-},v^{+})$ , $p^{\prime*}(v^{+})$ is at least $p^{\prime*}(u^{-})$ . Then, by the triangle inequality, we obtain:

[TABLE]

Note that the second to last inequality is true since $p^{\prime*}$ is monotone, and $p^{\prime*}(v^{+})$ has to be at least $p^{\prime*}(u^{-})$ . Therefore, if $p$ is $\epsilon$ -far from monotone, then $p^{\prime}$ is $\epsilon/4$ -far from monotone.

Thus, to distinguish whether $p$ is monotone or $\epsilon$ -far from any monotone distribution on $G$ , it is suffices to test if $p^{\prime}$ is monotone or $\epsilon/4$ -far from any monotone distribution on the bipartite poset $G^{\prime}$ . ∎

An interesting byproduct of Equation 4 is the following: If you consider the violation of each edge from monotonicity to be the weight of that edge, then the weight of the maximum weighted matching is the distance of the distribution to monotonicity. We formally explained it in the following theorem.

\distToMonMatching

Proof.

Let $W$ indicates the weight of the maximum weighted matching. Fix a matching $M$ of $k$ edges $(u_{i},v_{i})$ . Assume $p^{\prime}$ is the closest monotone distribution to $p$ , so $p^{\prime}(u_{i})\leq p^{\prime}(v_{i})$ for every edge $(u_{i},v_{i})$ . One can show the following:

[TABLE]

where the last inequality is true, because the above is true for any matching $M$ . On the other hand by Lemma 6.1, there exists a (directed) matching $M_{0}$ in $TC(G)$ , such that

[TABLE]

Thus, the proof is complete. ∎

6.1 Proof of auxiliary lemmas

Lemma 6.1.

Let $p$ be a probability distribution over the vertex set $V$ of an unweighted directed graph $G=(V,E)$ representing a poset. Then, there exists a matching $M$ on the transitive closure $TC(G)$ such that

[TABLE]

Proof.

Define $\epsilon$ to be the $\ell_{1}$ -distance of $p$ to monotonicity. We need to show the following:

[TABLE]

Let $f^{*}$ be the monotone function on $G$ closest to $p$ (in the $\ell_{1}$ -distance). Let $d$ denote $\|f^{*}-p\|_{1}$ : the $\ell_{1}$ -distance between $f^{*}$ and $p$ . Note that $f^{*}$ is not necessarily a probability distribution which implies that $d$ can be smaller than $\epsilon$ . To prove the above inequality, we will use $d$ as an intermediate variable which is in between the left hand side and the right hand side of the above inequality. Specifically, it suffices to prove the following:

(i)

$d\geq\epsilon/2$ ; 2. (ii)

there exists a matching $M$ on the transitive closure of $G$ such that $\sum_{(u,v)\in M}p(u)-p(v)=d$ .****

Proof of Item (i): To show that $d$ is at least $\epsilon/2$ , we prove that the monotone distribution $p_{f^{*}}$ , obtained by normalizing $f^{*}$ , is at most $2d$ -far from $p$ . Since any monotone distribution is at least $\epsilon$ -far from $p$ in $\ell_{1}$ -distance , we will have $\epsilon\leq\|p-p_{f^{*}}\|_{1}\leq 2d$ , establishing the desired claim.

First, note that if $f^{*}(v)$ is zero for all $v$ , then by definition $d$ is at least $\epsilon/2$ :

[TABLE]

where the inequality holds since the $\ell_{1}$ -distance between two distributions is always at most $2$ , so $\epsilon$ is as well. Hence, assume $f^{*}$ is not a zero function for the rest of the proof.

Also, note that $f^{*}$ is a non-negative function. We prove the non-negativity of $f^{*}$ by contradiction: assume $f^{*}(v)$ is negative for some $v$ . Consider a non-negative function $f(v)=\max\{f^{*}(v),0\}$ . It is not hard to see that $f$ is monotone due to monotonicity of $f^{*}$ . For every $v$ for which $f^{*}(v)<0$ , we have

[TABLE]

Since $f^{*}(v)=f(v)$ everywhere else, $\|p-f\|_{1}=\sum_{v\in V}|p(v)-f(v)|<\sum_{v\in V}|p(v)-f^{*}(v)|=\|p-{f^{*}}\|_{1}$ when $f^{*}$ contains some negative entry. This contradicts the fact that $f^{*}$ was the closest monotone function to $p$ , hence $f^{*}(v)$ has to be non-negative for all $v$ ’s.

Consider $p_{f^{*}}(v)=f^{*}(v)/\sum_{u}f^{*}(u)$ ; it follows that $p_{f^{*}}$ is a well-defined monotone distribution. Then,

[TABLE]

Thus, Item (i) is proved.

Proof of Item (ii): We leverage the duality theorem in linear programming. We write an LP that optimizes over all monotone functions $f$ ’s to find the function $f^{*}$ closest to $p$ under the $\ell_{1}$ -distance. Let $x(v)$ be the variable that indicates the amount of perturbation at vertex $x$ that is needed to make $p$ monotone. For an edge $(u,v)$ , the monotonicity constraint requires that $f(v)=p(v)+x(v)$ is at least $f(u)=p(u)+x(u)$ , or equivalently,

[TABLE]

Given this inequality, we can find the monotone function closest to $p$ by solving the following linear program:

[TABLE]

We denote the optimal solution for $\mathrm{\mathbf{LP3}}$ by $x^{*}(v)\coloneqq f^{*}(v)-p(v)$ , and the corresponding optimal value of the objective function by $d\coloneqq\|p-{f^{*}}\|_{1}$ .

To obtain the dual of $\mathrm{\mathbf{LP3}}$ , we write down its standard form by substituting $x(v)$ by $x^{+}(v)-x^{-}(v)$ as follows:

[TABLE]

Then $\mathrm{\mathbf{LP4}}$ has the following dual:

[TABLE]

By strong duality, the optimal value of $\mathrm{\mathbf{LP5}}$ is equal to the optimal value of $\mathrm{\mathbf{LP3}}$ , namely $d$ . On the other hand, the optimal solution of $\mathrm{\mathbf{LP5}}$ can help us to find a matching that satisfies the property in Item ii. Constraints of $\mathrm{\mathbf{LP5}}$ can be viewed in the form of $Ay\leq b$ and $y\geq 0$ . Since $A$ is a totally unimodular matrix by Lemma 6.2 (proved below), the LP admits an optimal solution that is also integral.

Let $y^{*}$ denote an integral optimal solution of the $\mathrm{\mathbf{LP5}}$ , and let $S$ be a multi-set of the edges, containing $y^{*}(u,v)$ copies of edge $(u,v)$ . Define the weight of each edge $(u,v)$ as $w(u,v)\coloneqq p(u)-p(v)$ , and let the weight of a set $S$ be the sum of the weight of the edges in $S$ . Thus:

[TABLE]

We construct a matching $M$ where $w(M)=w(S)$ , which completes the proof of Item ii. Based on the constraints of the $\mathrm{\mathbf{LP5}}$ , $S$ forms a subgraph on $G$ (but plausibly with multi-edges) such that the absolute difference between the number of incoming edges and outgoing edges at each vertex is at most one. Hence, we can decompose $S$ to paths and cycles.

Consider a path $P=\langle v_{1},v_{2},\ldots,v_{k}\rangle$ . Observe that the weight of a path only depends on its endpoints:

[TABLE]

Remark that the edge $(v_{1},v_{k})$ does not necessarily belong to $E$ , but since $v_{1}$ and $v_{k}$ are endpoints of a path $P$ , then $(v_{1},v_{k})$ is contained in the transitive closure of $G$ .

By the above equation, if we replace the edges of $P$ in $S$ by a single edge $(v_{1},v_{k})$ , then $w(S)$ remains unchanged. We can also remove all cycles without changing $w(S)$ since the weight of a cycle is always zero. Lastly, we may also join paths so that their endpoints are all distinct (since the difference between the in-degree and the out-degree of any vertex is at most one). After this process, we eventually obtain a matching $M$ on the transitive closure of $G$ such that

[TABLE]

concluding the proof of Item (ii) and this lemma. ∎

Lemma 6.2.

The matrix $A$ , namely the coefficient matrix of $\mathrm{\mathbf{LP5}}$ when the constraints are written in the form $Ay\leq b$ and $y\geq 0$ , is a totally unimodular matrix.

Proof.

We arrange the rows of $A$ so that the two constraints of each vertex $v_{i}$ occupy two consecutive rows $2i-1$ and $2i$ for $i=1,\ldots,n$ , and that each column $j$ corresponds to the edge $e_{j}=(u_{j},u^{\prime}_{j})$ for $j=1,\ldots,|E|$ . Then, each entry of $A$ can be described as follows:

[TABLE]

To prove that $A$ is a totally unimodular matrix, we make use of the following theorem.

Theorem 6.3 (Ghouila-Houri Characterization [GH62]).

An integral $m\times n$ matrix $A$ is a totally unimodular matrix if and only if, for any non-empty subset of rows, namely $R$ , there exists a disjoint partition of $R$ into $R_{1}$ and $R_{2}$ , such that the following is true.

[TABLE]

Here, for each non-empty subset $R\subseteq[2n]$ , we explicitly define $R_{1}$ and $R_{2}$ according to the following three conditions. (1) If both $2i-1$ and $2i$ are in $R$ , put both of them in $R_{1}$ . (2) If only $2i-1$ is in $R$ , then put $2i-1$ in $R_{1}$ . (3) If only $2i$ is in $R$ , then put $2i$ in $R_{2}$ .

Consider column $j$ corresponding to $e_{j}=(v_{r},v_{r^{\prime}})$ . This column has four non-zero entries:

[TABLE]

If both $2r-1$ and $2r$ appear in $R$ , or both of them are not in $R$ , clearly Equation 5 holds (similarly for $2r^{\prime}-1$ and $2r^{\prime}$ ). Thus, assume that exactly one of two rows $2r-1$ and $2r$ , and exactly one of the two rows $2r^{\prime}-1$ and $2r^{\prime}$ , are in $R$ . It is not hard to see that if the corresponding entries $A_{i,j}$ ’s in these rows have the same sign, then one row ends up in $R_{1}$ and the other row ends up in $R_{2}$ . If the entries have different signs, then both rows end up in the same set $R_{1}$ or $R_{2}$ . In both of these cases, the sum in Equation 5 becomes zero. Hence, the proof is complete. ∎

7 Algorithms with Sublinear Sample Complexity

In this section, we provide sublinear sample complexity algorithms for testing bigness, and testing monotonicity of distributions over different poset domains. See Section 3.4 for proof overviews.

7.1 An Algorithm for Bigness Testing

We give an algorithm for the bigness testing problem that requires a sublinear number of samples. For testing bigness, all the domain elements must be at least a threshold $T$ . The high level idea is to learn the histogram of the distribution use a result from [VV17]. Then given the histogram, if the weight of the elements that are below the threshold is less than $\Theta(\epsilon)$ , then we can accept the distribution, otherwise we reject.

First, we define the histogram of a distribution.

Definition 7.1.

For a distribution $p$ , we define $h_{p}:(0,1]\rightarrow\mathbb{N}\cup\{0\}$ to be the histogram of $p$ if and only if for all $x\in(0,1)$ , $h(x)$ is the number of domain element $i$ such that $p(i)$ is equal to $x$ .

Let $\pi:[n]\rightarrow[n]$ be a permutation of the domain elements. We define $p^{(\pi)}$ to be the permutation of $p$ according to $\pi$ such that for all domain element $i$ , $p^{(\pi)}(i)$ is equal to $p(\pi(i))$ . Based on the definition, it is not hard to see permutation does not change the number of domain element with a certain probability, so $h(p)$ and $h(p^{(\pi)})$ are the same. Hence, when we learn the histogram of $p$ , we can claim that we learn $p$ up to a permutation.

For learning, we will use a result from [VV17] for learning discrete distributions, up to a permutation of the domain elements. In Theorem 1.11 of [VV17], combined with Fact 1 of [VV16], authors provided the following theorem:

Theorem 7.2 ([VV17, VV16]).

There exists an algorithm that, given $O\left(\frac{n}{\epsilon^{2}\log n}\right)$ i.i.d. samples from an unknown distribution $p$ , outputs an explicit description of a distribution, namely $q$ , such that there exists a permutation $\pi:[n]\rightarrow[n]$ where $\sum_{i\in[n]}|p(i)-q(\pi(i))|\leq\epsilon$ with success probability $2/3$ .

This theorem implies the following upper bound for bigness testing.

Corollary 7.3.

For bigness threshold $T\leq 1/n$ , there exists an algorithm that distinguishes whether a distribution $p$ is $T$ -big or $\epsilon$ -far from $T$ -big with success probability $2/3$ using $O(\frac{n}{\epsilon^{2}\log n})$ i.i.d. samples from $p$ .

Proof.

We refer to Algorithm 1 for the outline of our procedure. Let $q$ denote the distribution outputted by the “learner” as promised by Theorem 7.2 with distance parameter $\epsilon^{\prime}=\epsilon/3$ . Let $\pi$ be the permutation guaranteed by Theorem 7.2. We define $q^{\prime}$ be the distribution obtained by permuting the elements of $q$ according to the associated permutation such that for each domain element $i$ , let $q^{\prime}(i)=q(\pi(i))$ . Hence, with probability at least 2/3, $d_{TV}(p,q^{\prime})$ is at most $\leq\epsilon^{\prime}$ . Note that $\pi$ is not known to the algorithm, but used for the analysis.

Now, we have the following two cases: If $p$ is $T$ -big, then

[TABLE]

On the other hand, if $p$ is $\epsilon$ -far from $T$ -big, then

[TABLE]

That is, $q$ offers us a condition for $T$ -bigness testing by simply measuring its distance to $T$ -bigness (the if condition of Algorithm 1). Therefore, Algorithm 1 outputs the correct answer with probability at least 2/3. Note that learning $p$ using parameter $\epsilon^{\prime}=\Theta(\epsilon)$ does not change the asymptotic sample complexity, so the proof is complete. ∎

7.2 An Algorithm for Testing Monotonicity on Matchings

We give a sublinear time algorithm for testing monotonicity on matchings. Similar to the previous section, we use a result from [VV17] for learning the distribution histogram of a pair of distributions. First we employ the following definitions (see also Definition 5.2 and Definition 5.4 of [VV17]). A distribution histogram of a pair of distributions is a function that counts the number of elements with a given probability mass $x$ in the distribution $p_{1}$ and $y$ in the distribution $p_{2}$ . More formally, we have the following definition:

Definition 7.4 ([VV17]).

For a pair of distributions $p_{1}$ and $p_{2}$ , we say $h_{p_{1},p_{2}}:[0,1]^{2}\setminus\{(0,0)\}\rightarrow\mathbb{N}\cup\{0\}$ is the distribution histogram of $p_{1}$ and $p_{2}$ if and only if for any $(x,y)$ in the domain: $h_{p_{1},p_{2}}(x,y)=|\{a:p_{1}(a)=x,p_{2}(a)=y\}|$ .

We will use this two-dimensional histogram to indicate a histogram of a distribution over a matching of size $n$ : Let $p_{1}$ and $p_{2}$ be the two distributions that $p$ imposes on the top and the bottom vertices in the matching respectively. Without loss of generality assume the edges in the matching connects the $i$ -th vertex in the bottom to the $i$ -th vertex in the top. Note that $h_{p_{1},p_{2}}(x,y)$ counts the number of domain elements $a\in[n]$ such that $p_{1}(a)=x$ and $p_{2}(a)=y$ . Hence, $\int_{x=0}^{1}\int_{y=0}^{1}h_{p_{1},p_{2}}(x,y)dy\,dx$ is the number of matched pairs of vertices with at least one non-zero probability vertex. Since the sum of probabilities according to $p_{1}$ is one, we have $\int_{x=0}^{1}\int_{y=0}^{1}x\cdot h(x,y)=1$ . This is similarly true for $p_{2}$ : $\int_{x=0}^{1}\int_{y=0}^{1}y\cdot h(x,y)=1$ .

Now, we define the distance between two histograms of two distributions: $h$ and $g$ . At a high level, the distance between two histograms is the minimum cost one needs to pay to “transform” $h$ to $g$ . In particular, we transform one histogram to another by moving mass from one point to another: By moving mass $c$ from $(x,y)$ to $(x^{\prime},y^{\prime})$ , we obtain another histogram $h^{\prime}$ , such that $h^{\prime}(x,y)=h(x,y)-c$ , $h^{\prime}(x^{\prime},y^{\prime})=h(x,y)+c$ and for all other points in $[0,1]^{2}$ , $h$ and $h^{\prime}$ are equal. The cost of this move is $c\cdot(|x-x^{\prime}|+|y-y^{\prime}|)$ . More formally, we have the following definition.

Definition 7.5 ([VV17]).

For a pair of functions $h,g:[0,1]^{2}\setminus\{(0,0)\}\rightarrow\mathbb{N}\cup\{0\}$ , we define the distance notation $W(h,g)$ as the minimum cost over all mass moving schemes with finitely many steps for turning $h$ into $g$ , where the cost for moving value $c>0$ from point $(x,y)$ to $(x^{\prime},y^{\prime})$ is $c(|x-x^{\prime}|+|y-y^{\prime}|)$ . Note that we assume that $\sum_{x,y}h(x,y)=\sum_{x,y}g(x,y)$ , where extra value at point $(0,0)$ on $h$ or $g$ may be added to ensure this equality.

Let $p^{(\pi)}$ be the permuted distribution of $p$ according to the permutation $\pi$ of $[n]$ such that for each domain element $i$ , ${p^{\prime}}_{1}^{(\pi)}(i)=p^{\prime}_{1}(\pi(i))$ . Note that as long as we permute $p_{1}$ and $p_{2}$ with the same permutation, the distribution histogram $h_{p_{1},p_{2}}$ and $h_{p_{1}^{(\pi)},p_{2}^{(\pi)}}$ are the same. Moreover, given $h_{p_{1},p_{2}}$ one can construct $q_{1}$ and $q_{2}$ such that there exists a permutation $\pi$ for which $q_{1}$ and $q_{2}$ are the permuted versions of $p_{1}$ and $p_{2}$ according to $\pi$ .

We relate the distance $W$ to the total variation distance in the following Lemma. In particular, the distance $W$ between two distribution histograms $h_{p_{1},p_{2}}$ , $h_{p^{\prime}_{1},p^{\prime}_{2}}$ defined according to two pairs of distributions $(p_{1},p_{2})$ , $(p^{\prime}_{1},p^{\prime}_{2})$ upper bounds the $\ell_{1}$ -distance up to a permutation of the labels of the domain elements.

Lemma 7.6.

Let functions $h_{p_{1},p_{2}}$ , $h_{p^{\prime}_{1},p^{\prime}_{2}}$ be defined according to two pairs of probability vectors $(p_{1},p_{2})$ , $(p^{\prime}_{1},p^{\prime}_{2})$ . There exists a permutation $\pi$ of $[n]$ such that

[TABLE]

Proof.

According to the definition of the distance, $W$ , there exists a moving scheme consisting of a sequence of $R$ steps, denoted by $\langle(c_{r},(x_{r},y_{r}),(x^{\prime}_{r},y^{\prime}_{r}))\rangle_{r\in[R]}$ (with $c_{r}>0$ ), describing the changes that eventually turn $h_{p_{1},p_{2}}$ into $h_{p^{\prime}_{1},p^{\prime}_{2}}$ for which we move the mass of $c_{r}$ from the source $(x_{r},y_{r})$ to sink $(x^{\prime}_{r},y^{\prime}_{r})$ at step $r$ . We claim that if the scheme has minimum cost, $W(h_{p_{1},p_{2}},h_{p^{\prime}_{1},p^{\prime}_{2}})$ , without loss of generally, we may make the following assumptions about the scheme: (1) There are no two steps $r_{1}$ and $r_{2}$ such that $(x^{\prime}_{r_{1}},y^{\prime}_{r_{1}})$ is the same as $(x_{r_{2}},y_{r_{2}})$ . (2) All the $c_{r}$ ’s are positive integers.

To see why (1) is true, assume otherwise; if $r_{1}=r_{2}$ , then $(x^{\prime}_{r_{1}},y^{\prime}_{r_{1}})=(x_{r_{2}},y_{r_{2}})$ means that the source and the sink in step $r_{1}$ is the same, so no mass is actually moved. Hence, we can just remove this step without changing the scheme. if $r_{1}\neq r_{2}$ , then $(x^{\prime}_{r_{1}},y^{\prime}_{r_{1}})=(x_{r_{2}},y_{r_{2}})$ means that mass of quantity $\min(c_{r_{1}},c_{r_{2}})$ is first moved from $(x_{r_{1}},y_{r_{1}})$ to $(x^{\prime}_{r_{1}},y^{\prime}_{r_{1}})$ , and then moved from $(x^{\prime}_{r_{1}},y^{\prime}_{r_{1}})$ to $(x^{\prime}_{r_{2}},y^{\prime}_{r_{2}})$ . Clearly, one can move the same quantity of mass from $(x_{r_{1}},y_{r_{1}})$ to $(x^{\prime}_{r_{2}},y^{\prime}_{r_{2}})$ directly with no larger cost, making one of the steps $r_{1}$ or $r_{2}$ vacuous.

Given (1), we now show that (2) also holds: Note that given (1), each point $(x,y)$ may appear in the steps as either a source or a sink, but not both. Moreover, the order of the steps does not matter, since the source always has the capacity for providing the mass. If there are several steps that move mass between the same source and the same sink, one can replace all of them with one step moving the total quantity of mass moved between them. Now, we can assume between each source and each sink there is a well defined quantity indicating how much mass we moved from the source to sink. This fact helps us to form a graph where the vertices are the sources and the sinks which appeared in the scheme. We put a directed edge from a source to a sink if we moved a non-integer mass from the source to the sink. We assign a weight to the edge which is the fractional part of the mass we moved from the source to the sink. We propose the following process for changing the steps for which each change removes at least one edge from the graph. We keep repeating the process until no edge remains to assure that all $c_{r}$ ’s are integers.

Remove sources or sinks with no edge. Clearly, the graph is bipartite, and all the edges are from sources to sinks. Since $h_{p_{1},p_{2}}$ and $h_{p^{\prime}_{1},p^{\prime}_{2}}$ are integer, the final mass at each source and sink will eventually be an integer. Hence, each source has an out-degree of at least two and each sink has an in-degree of at least two. Therefore, the graph has an undirected cycle with an even length. Let $S$ and $T$ be the sets of the sources and the sinks involved in the cycle respectively. Let $E_{1}$ and $E_{2}$ be a partition of the edges in the cycle such that every other edge is in the same set. Clearly, each source (and sink) has exactly one edge in $E_{1}$ and one edge in $E_{2}$ . As we define before the cost of moving one unit of mass via an edge from $(x,y)$ to $(x^{\prime},y^{\prime})$ is $|x-x^{\prime}|+|y-y^{\prime}|$ . We define the cost of $E_{1}$ (and $E_{2}$ ) to be the total cost of edges in $E_{1}$ (and $E_{2}$ ). Without loss of generality assume cost of $E_{1}$ is not greater than the cost of $E_{2}$ . Let $c^{*}$ be the minimum weight of edges in $E_{2}$ . We modify the steps such that each step with a corresponding edge is $E_{2}$ moves $c^{*}$ less mass, and each steps with a corresponding edge in $E_{2}$ moves $c^{*}$ more mass. Clearly, this process does not increase the total cost of the scheme. However, it makes the fractional part of at least one step equal to zero. We repeat this process until no such step exists which concludes the proof for claiming (2).

Let $h^{(0)},h^{(1)},\ldots,h^{(R)}$ be the series of the distribution histograms which is generated during the mass moving scheme after each step. $h^{(0)}$ is the distribution histogram we start with, $h_{p_{1},p_{2}}$ , and $h^{(R)}$ is the final distribution histogram $h_{p^{\prime}_{1},p^{\prime}_{2}}$ . Now, we create a sequence of pairs of vectors $p^{(r)}_{1},p^{(r)}_{2}:[n]\rightarrow[0,1]$ such that $h^{(r)}=h_{p^{(r)}_{1},p^{(r)}_{2}}$ (under the same definition of distribution histogram, relaxed to allow non-distributions $p^{(r)}_{1},p^{(r)}_{2}$ ). We start off with $p^{(0)}_{1}$ and $p^{(0)}_{2}$ being $p_{1}$ and $p_{2}$ . Given $p^{(r-1)}_{1},p^{(r-1)}_{2}$ , we obtain $p^{(r)}_{1},p^{(r)}_{2}$ as follows.

Consider step $r$ described as $(c_{r},(x_{r},y_{r}),(x^{\prime}_{r},y^{\prime}_{r}))$ with an integer $c_{r}$ . Inductively, assume $h^{(r-1)}=h_{p^{(r-1)}_{1},p^{(r-1)}_{2}}$ which implies that $p^{(r-1)}_{1}$ and $p^{(r-1)}_{2}$ contain at least $c_{r}\leq h^{(r-1)}(x_{r},y_{r})$ entries $i$ with $p^{(r-1)}_{1}(i)=x_{r}$ and $p^{(r-1)}_{2}(i)=y_{r}$ . To apply step $r$ , we pick an arbitrary set $I_{r}$ of $c_{r}$ many such entries, then modify the entries $p^{(r-1)}_{1}(i)$ and $p^{(r-1)}_{2}(i)$ from $x_{r}$ and $y_{r}$ to $x^{\prime}_{r}$ and $y^{\prime}_{r}$ respectively for each $i\in I_{r}$ . That is, $p^{(r)}_{1}(i)=x^{\prime}_{r}$ and $p^{(r)}_{2}(i)=y^{\prime}_{r}$ for $i\in I_{r}$ , and $p^{(r)}_{1}(i)=p^{(r-1)}_{1}(i)$ and $p^{(r)}_{2}(i)=p^{(r-1)}_{2}(i)$ for $i\notin I_{r}$ . Hence, the $\ell_{1}$ -distance incurred by step $r$ becomes:

[TABLE]

By summing over all $R$ steps, and applying the triangle inequality, we have:

[TABLE]

Now it remains to show that there exists a permutation $\pi$ that maps the labels of the given distribution $p^{\prime}_{1},p^{\prime}_{2}$ to our constructed vectors $p^{(R)}_{1},p^{(R)}_{2}$ ; namely, $p^{\prime(\pi)}_{1}=p^{(R)}_{1}$ and $p^{\prime(\pi)}_{2}=p^{(R)}_{2}$ . Indeed, $h_{p^{\prime}_{1},p^{\prime}_{2}}$ is the distribution histogram that counts the number of indices $i$ with $p^{\prime}_{1}(i)=x$ and $p^{\prime}_{2}(i)=y$ , so $h_{p^{\prime}_{1},p^{\prime}_{2}}=h_{p^{R}_{1},p^{R}_{2}}$ implies that for every $(x,y)$ , there are also equally many indices $i^{\prime}$ with $p^{R}_{1}(i^{\prime})=x$ and $p^{R}_{2}(i^{\prime})=y$ . Hence, there exists a bijection between their indices that maps $i^{\prime}$ ’s to $i$ ’s and vice versa, concluding the lemma. ∎

Next, we state the the result of [VV17] to learn the distribution histogram of a pair of distributions.

Theorem 7.7 (Theorem 5.6 of [VV17]).

There exists an algorithm that, given $O\left(\frac{n}{\epsilon^{2}\log n}\right)$ i.i.d. samples each from a pair of unknown distributions $p_{1}$ and $p_{2}$ , outputs a function $g$ such that $W(h_{p_{1},p_{2}},g)\leq\epsilon$ with success probability $2/3$ .

We now prove the upper bound for the monotonicity testing problem over the matching poset.

Theorem 7.8.

For sufficiently small positive constant $\epsilon$ , there exists an algorithm that distinguishes whether a distribution $p$ over the vertex set $V=S\cup T$ of a directed matching $M_{n}$ on $2n$ vertices is monotone or $\epsilon$ -far from monotone with success probability $2/3$ using $O(\frac{n}{\epsilon^{2}\log n})$ i.i.d. samples from $p$ .

Proof.

For clarity, denote the edge set of the graph $G=(V,E)$ with the set of edges $E=\{(u_{i},v_{i})\}_{i\in[n]}$ , and the set of vertices $V=S\cup T$ where $S=\{u_{i}\}_{i\in[n]}$ and $T=\{v_{i}\}_{i\in[n]}$ . For a distribution $p$ over $V=S\cup T$ , let $p_{S}$ and $p_{T}$ denote the probability mass $p$ places on elements of $S$ and $T$ ; note that $p_{S}$ and $p_{T}$ are functions on domain $S$ and $T$ , but generally not probability distributions.

The outline of our algorithm is given as Procedure $\textsc{Monotonicity-Testing-over-}M_{n}$ in Algorithm 2. In our algorithm, we hope to invoke Theorem 7.7 by considering the (normalized) $p_{S}$ and $p_{T}$ as our $p_{1}$ and $p_{2}$ , respectively. However, Theorem 7.7 requires roughly the same number of samples from both $p_{1}$ and $p_{2}$ , while $p_{S}$ and $p_{T}$ may have vastly different total probability masses; for instance, it may be costly to try to obtain many samples from $S$ .

Before we proceed, by Theorem 3.3, it is straightforward to see:

[TABLE]

In order to make the probability of the top and the bottom vertices at least a constant, we define an auxiliary probability distribution $p^{\prime}$ obtained by averaging $p$ with a monotone distribution: $p^{\prime}(w)=p(w)/2+1/(4n)$ where $w\in V$ . Clearly, if $p$ is monotone, then $p^{\prime}$ is monotone too. Also, if $p$ is $\epsilon$ -far from monotone, then observe that the distance of $p^{\prime}$ to monotone is

[TABLE]

which preserves the distance to monotone to a factor of $4$ . We can generate samples for $p^{\prime}$ using asymptotically the same number of samples from $p$ : A sample from $p^{\prime}$ is obtained by drawing a sample from $p$ or drawing a uniform random vertex with probability $1/2$ each (Procedure $\textsc{Sample-from-}p^{\prime}$ in Algorithm 2); henceforth, we consider the problem of testing $p^{\prime}$ for monotonicity with distance $\epsilon/4$ instead.

The main benefit for considering the monotonicity testing on $p^{\prime}$ instead of $p$ is that the total amount of probability masses placed on $S$ and on $T$ are at least $1/4=\Omega(1)$ each. Hence, it takes $\Theta(s)$ samples from $p$ according to the procedure above to obtain at least $s$ samples from each of $S$ and $T$ with good constant probability; that is, we can create our input for the algorithm in Theorem 7.7 using $\Theta(s)$ i.i.d. samples from $p$ .

Denote by $w_{S},w_{T}$ the total probability masses that $p^{\prime}$ places on $S$ and $T$ , respectively. Let $p^{\prime}_{S}$ and $p^{\prime}_{T}$ be the probability function that $p$ assigns to vertices of $S$ and $T$ , respectively. Let $\widetilde{p^{\prime}_{S}}$ and $\widetilde{p^{\prime}_{T}}$ be the distributions over $S$ and $T$ that are obtained by normalizing $p^{\prime}_{S}$ and $p^{\prime}_{T}$ (separately). More precisely, we have

[TABLE]

Let $\epsilon^{\prime}=\Theta(\epsilon)$ (to be determined exactly later). Invoking Theorem 7.7 with this parameter, we obtain a function $\widetilde{g}$ where $W(h_{\widetilde{p^{\prime}_{T}},\widetilde{p^{\prime}_{S}}},\widetilde{g})\leq\epsilon^{\prime}$ using $O(\frac{n}{\epsilon^{2}\log n})$ samples from $p$ .

Next, we rescale each dimension of $\widetilde{g}$ back by $w_{S}$ and $w_{T}$ , thereby obtaining our estimate of $h_{p^{\prime}_{S},p^{\prime}_{T}}$ . If we knew $w_{S}$ and $w_{T}$ exactly, we would define $g(w_{S}\cdot x,w_{T}\cdot y)=\widetilde{g}(x,y)$ , and we would have $W(h_{p^{\prime}_{S},p^{\prime}_{T}},g)\leq\epsilon^{\prime}$ . However, we can only estimate $w_{S}$ and $w_{T}$ up to an additive error $\epsilon^{\prime}$ with high constant probability using $O(1/\epsilon^{2})$ samples. To this end, let $\hat{w}_{S}$ be the estimate of $w_{S}$ , and let $\hat{w}_{T}=1-\hat{w}_{S}$ . We define $\hat{g}$ for which $\hat{g}(\hat{w}_{S}\cdot x,\hat{w}_{T}\cdot y)=\widetilde{g}(x,y)$ . Below, we show that $\hat{g}$ is a good estimation of $h_{p^{\prime}_{S},p^{\prime}_{T}}$ .

Recall that $W(h_{\widetilde{p^{\prime}_{S}},\widetilde{p^{\prime}_{T}}},\widetilde{g})\leq\epsilon^{\prime}$ . By definition, there exists a minimum-cost sequence of steps $\langle(c_{r},(x_{r},y_{r}),(x^{\prime}_{r},y^{\prime}_{r}))\rangle_{r\in[R]}$ for turning $\widetilde{g}$ to $h_{\widetilde{p^{\prime}_{T}},\widetilde{p^{\prime}_{S}}}$ :

[TABLE]

Observe that under the cost function in Definition 7.5, we may assume without loss of generality that there are no $r,r^{\prime}$ such that $(x,y)=(x^{\prime}_{r},y^{\prime}_{r})=(x_{r^{\prime}},y_{r^{\prime}})$ in the moving scheme. Namely, we can instead “shortcut” this scheme by moving the value $\min\{c_{r},c_{r^{\prime}}\}$ from $(x_{r},y_{r})$ to $(x^{\prime}_{r^{\prime}},y^{\prime}_{r^{\prime}})$ directly without leaving any extra amount at $(x,y)$ (during step $r$ ) to pick up later (during step $r^{\prime}$ ). In this moving scheme, the value of $h_{p^{r}_{S},p^{r}_{T}}$ on any $(x,y)$ must be non-increasing or non-decreasing throughout the steps $r\in[R]$ (since values are only being moved in, or only being moved out, but not a mixture of both). In particular, this condition implies that the total value of $c_{r}$ ’s moving into $(x^{\prime},y^{\prime})$ never exceeds the value of $h_{\widetilde{p^{\prime}_{S}},\widetilde{p^{\prime}_{T}}}(x^{\prime},y^{\prime})$ . More formally,

[TABLE]

Now, we are ready to bound $W(\hat{g},h_{p^{\prime}_{S},p^{\prime}_{T}})$ . By definition, we have $h_{p^{\prime}_{S},p^{\prime}_{T}}(w_{S}\cdot x,w_{T}\cdot y)=h_{\widetilde{p^{\prime}_{S}},\widetilde{p^{\prime}_{T}}}$ and $\hat{g}(\hat{w}_{S}\cdot x,\hat{w}_{T}\cdot y)=\tilde{g}(x,y)$ . Thus, any moving scheme that turns $\tilde{g}$ into $h_{\widetilde{p^{\prime}_{S}},\widetilde{p^{\prime}_{T}}}$ , will also turn $\hat{g}$ into $h_{p^{\prime}_{S},p^{\prime}_{T}}$ . Hence, we can use the same sequence (up to scaling) for moving the mass from $h_{\widetilde{p^{\prime}_{S}},\widetilde{p^{\prime}_{T}}}$ to $\tilde{g}$ to show a bound for $W(\hat{g},h_{p^{\prime}_{S},p^{\prime}_{T}})$ : at step $r\in[R]$ , we move the value $c_{r}$ from $g(\hat{w}_{S}\cdot x,\hat{w}_{T}\cdot y)$ to $h(w_{S}\cdot x^{\prime},w_{T}\cdot y^{\prime})$ . We establish our bound as follows.

[TABLE]

Going back to our algorithm, we compute $g^{*}$ : the function minimizing $W(\hat{g},g^{*})$ that is also defined according to an actual monotone probability distribution $q^{*}$ over $V$ . Observe that if $p^{\prime}$ is monotone, then

[TABLE]

due to the optimality assumption above. On the other hand, if $p^{\prime}$ is $\epsilon/4$ -far from monotone, then by choosing $\epsilon^{\prime}=\epsilon/14$ ,

[TABLE]

for some permutation $\pi$ over $[n]$ , where $q^{*(\pi)}(u_{i})=q^{*}(u_{\pi(i)})$ and $q^{*(\pi)}(v_{i})=q^{*}(v_{\pi(i)})$ , making use of Lemma 7.6 above. Hence, $g$ provides us with a condition for testing monotonicity over the matching poset $M_{n}$ , as desired. ∎

7.3 An Algorithm for Testing Monotonicity on Bounded Degree Bipartite Graphs with Sub-linear Sample Complexity

We give an algorithm which tests monotonicity of a distribution $p$ on a bipartite poset $G$ with sample complexity $O\left(\frac{\Delta^{3}n}{\epsilon^{2}\log n}\right)$ where $\Delta$ denotes an upper bound for the maximum degree over all vertices in $G$ . Given sample access to the distribution $p$ , we implement a sampling oracle for a certain distribution $p^{\prime}$ on a matching poset $G^{\prime}$ with $O(\Delta n)$ vertices. This distribution $p^{\prime}$ is monotone on $G^{\prime}$ if $p$ is monotone on $G$ , and $p^{\prime}$ is $\epsilon/(2\Delta)$ -far from monotone on $G^{\prime}$ if $p$ is $\epsilon$ -far on $G$ . Hence, we apply the algorithm for testing monotonicity on the matching poset $G^{\prime}$ to test the monotonicity of $p^{\prime}$ , immediately obtaining the desired sample complexity. We describe the construction of $G^{\prime}$ and the distribution $p^{\prime}$ below and show the correctness of our approach in Theorem 7.10.

More formally, let $p$ be a distribution over a directed bipartite poset $G=(V=V_{B}\cup V_{T},E)$ where $V_{B}=\{u_{i}\}_{i\in[n]}$ and $V_{T}=\{v_{i}\}_{i\in[n]}$ are the sets of the bottom and the top vertices, and $E\subseteq V_{B}\times V_{T}$ is the set of edges. Let $\Delta$ be an upper bound on the degree of $G$ .

The matching poset $G^{\prime}$ .

Based on $G$ , we create a matching $G^{\prime}=(V^{\prime}=V^{\prime}_{b}\cup V^{\prime}_{t},E^{\prime})$ over $n^{\prime}=O(\Delta n)$ vertices according the following procedure. Similar to $G$ , $V^{\prime}_{b}$ is the set of bottom vertices, $V^{\prime}_{t}$ is the set of top vertices, and $E^{\prime}$ is the set of edges.

•

Create $\Delta$ copy vertices $w^{1},\ldots,w^{\Delta}$ for each vertex $w\in V$ .

•

For each edge $e=(u,v)\in E$ , match an unmatched pair of vertices $u^{i},v^{j}$ via the copy edge $e^{\prime}=(u^{i},v^{j})$ ; place $u^{i}\in V^{\prime}_{b}$ , $v^{j}\in V^{\prime}_{t}$ and $e^{\prime}\in E^{\prime}$ .

•

For all remaining unmatched vertices $w^{i}$ , create a dummy vertex $\bar{w}^{i}$ , then match it to $w^{i}$ via the dummy edge $\bar{e}_{w^{i}}=(\bar{w}^{i},w^{i})$ ; place $\bar{w}^{i}\in V^{\prime}_{b}$ , $w^{i}\in V_{T}^{\prime}$ and $\bar{e}_{w^{i}}\in E^{\prime}$ . Note that the dummy vertex is always put in the bottom set.

Note that the second step above is always possible since there are at most $\Delta$ edges incident to a vertex.

Distribution $p^{\prime}$ over $G^{\prime}$ .

The distribution $p^{\prime}$ over the poset $G^{\prime}$ is defined as follows. For each copy vertex $w^{i}$ , set $p^{\prime}(w^{i})=p(w)/\Delta$ . For each dummy vertex $\bar{w}^{i}$ , set $p^{\prime}(\bar{w}^{i})=0$ . One can generate a sample from $p^{\prime}$ , by drawing a sample $w$ in $V$ according to $p$ , and drawing $i$ uniformly at random from $[\Delta]$ : The $i$ -th copy of $w$ , $w^{i}$ , is a sample drawn from $p^{\prime}$ .

In the following lemma, we show that the distance of $p^{\prime}$ to being monotone is closely related to the distance of $p$ to monotonicity.

Lemma 7.9.

Let $p$ and $p^{\prime}$ be two distributions over $G$ and $G^{\prime}$ as described above. If $p$ is monotone, then $p^{\prime}$ is monotone. If $p$ is $\epsilon$ -far from being monotone, then $p^{\prime}$ is $(\epsilon/2\Delta)$ -far from being monotone.

Proof.

Observe that for each copy edge $e^{\prime}=(u^{i},v^{j})$ , the probabilities at the endpoints are $p^{\prime}(u^{i})=p(u)/\Delta$ and $p^{\prime}(v^{j})=p(v)/\Delta$ , respectively. Thus, if $p(u)$ is at most $p(v)$ , then $p^{\prime}(u^{i})$ will remain at most $p^{\prime}(v^{j})$ . Furthermore, for each dummy edge $\bar{e}_{w^{i}}=(\bar{w}^{i},w^{i})$ , the probability of the bottom vertex, $p^{\prime}(\bar{w}^{i})$ , is zero, so this edge never violates the monotonicity of $G^{\prime}$ . Hence it follows immediately that if $p$ is monotone on $G$ , then $p^{\prime}$ is monotone on $G^{\prime}$ as well.

On the other hand, assume $p$ is $\epsilon$ -far from being monotone. We define a weighted graph on the transitive closure of $G$ , $TC(G)$ , where the weight of each edge $(u,v)$ is $\max(p(u)-p(v),0)$ . By the proof of Theorem 3.3, $TC(G)$ has a weighted matching, namely $M$ , of weight $W$ such that

[TABLE]

Since $G$ is a bipartite poset, and the edges are all from $V_{B}$ to $V_{T}$ , $TC(G)$ is the same as $G$ . Hence, each edge $e=(u,v)$ in $M$ is in $E$ as well. Also, by the construction of $G^{\prime}$ , there exists a copy edge $e^{\prime}=(u^{i},v^{j})\in E^{\prime}$ that corresponds to $e$ . Let $M^{\prime}$ be the set of copy edge $e^{\prime}=(u^{i},v^{j})$ where $e=(u,v)$ is in $M$ . $M^{\prime}$ is a matching in $G^{\prime}$ as well.

Observe that by the above construction, the weight of $e^{\prime}=(u^{i},v^{j})$ is $\max(p^{\prime}(u^{i})-p^{\prime}(v^{j}),0)=\max(p(u)-p(v),0)/\Delta$ . Hence, $G^{\prime}$ contains a matching, $M^{\prime}$ , of weight $W^{\prime}\coloneqq W/\Delta$ which is at most the weight of the maximum matching in $G^{\prime}$ . Let $W^{\prime}$ be the weight of the maximum matching in $G^{\prime}$ . By Theorem 3.3 and Equation 6, we obtain:

[TABLE]

Thus, if $p$ is $\epsilon$ -far from being monotone, then $p^{\prime}$ is $\epsilon/(2\Delta)$ -far from monotone as well, concluding the lemma. ∎

Given the above lemma, it is sufficient to test monotonicity of $p^{\prime}$ with proximity parameter $\epsilon^{\prime}=\epsilon/(2\Delta)$ . See Algorithm 3 for the steps. Below, we show the correctness of the algorithm.

Corollary 7.10.

There exists an algorithm that tests whether a distribution $p$ over a bipartite poset $G$ of $n$ vertices and maximum degree $\Delta$ , is monotone or $\epsilon$ -far from monotone with success probability $2/3$ , using $O(\frac{\Delta^{3}n}{\epsilon^{2}\log n})$ i.i.d. samples from $p$ .

Proof.

Given Lemma 7.9, it suffices to test the monotonicity of $G^{\prime}$ with parameter $\epsilon^{\prime}=\epsilon/\Delta$ . Using Theorem 7.8 and since $G^{\prime}$ is a matching of size $n^{\prime}=O(\Delta n)$ , one can test monotonicity of $p^{\prime}$ with high probability using $O(n^{\prime}/(\epsilon^{\prime 2}\log n^{\prime}))=O({\Delta^{3}n}/({\epsilon^{2}\log n}))$ samples as desired. Therefore, the proof is complete. ∎

7.4 Testing monotonicity of distributions that are uniform on a subset of the domain

In this section, we give an algorithm for testing monotonicity on a specific yet broad class of instances. More specifically, suppose that we are given a directed bipartite graph $G(V=V_{T}\cup V_{B},E\subseteq V_{T}\times V_{B})$ , along with a probability distribution on the set $V$ . Note that all the directed edges go from a vertex in the “bottom” set $V_{B}$ , to a vertex in the “top” set $V_{T}$ . We additionally assume that all distributions which we sample from are uniform on a subset of $V$ whose size is known to the algorithm. That is, for every vertex $u\in V$ either $p_{u}=0$ or $p_{u}=1/|R|$ , where $R$ is the support of the distribution $p$ .

We will show the following result:

Theorem 7.11.

*Let $G$ be a directed bipartite graph as described above and $p$ be a probability distribution on $V$ which is uniform on a subset of $V$ , namely $R$ . Given the size of $R$ , there exists an algorithm with sample complexity $O(\frac{n^{2/3}}{\epsilon}+\frac{1}{\epsilon^{2}})$ that can test, with success probability $2/3$ , whether $p$ is monotone on $G$ , or $p$ is $\epsilon$ -far from any monotone function on $G$ , *

At a high level, our tester works as follows: We draw an initial set $\mathcal{S}_{1}$ of $s_{1}$ samples from $p$ . We define $B=\mathcal{S}_{1}\cap V_{B}$ to be the set of vertices from the bottom, $V_{B}$ , that we see in the sample set. Then, we look at the set $T\subseteq V_{T}$ containing all out-neighbors of the vertices in $B$ . We show the following structural property of distributions that are $\epsilon$ -far from being monotone: in expectation, the constructed set $T$ contains $\epsilon/s_{1}$ endpoints of violating edges, so $|T|$ cannot be too small. Thus, if $|T|$ is much smaller than $\epsilon/s_{1}$ , we can immediately conclude that the distribution is close in total variation distance to some monotone distribution. However, if $T$ is sufficiently large in cardinality, we draw more samples in order to estimate the amount of probability mass on $T$ . Note that if $p$ is monotone, then we expect that all the elements in $T$ be in the support of the distribution, namely $R$ , so every single element of $T$ should have probability mass $\frac{1}{|R|}$ for the distribution to be monotone. The tester rejects if there is sufficient evidence that this is not the case. More specifically, the proposed tester is given in Algorithm 4.

Proof of Theorem 7.11: As given in the algorithm, let $s_{1}=O(\frac{n^{2/3}}{\epsilon})$ and $s_{2}=O(n^{2/3})$ denote the sample sizes of the two steps described earlier. We consider the following two cases.

Completeness case: Assume $p$ is a monotone distribution. Clearly, each sample we draw has a non-zero probability. Since we pick $T$ to be the neighbor set of the samples we draw, we know that every element in $T$ has a non-zero probability. By the uniformity assumption, this probability is $|T|/|R|$ . Thus, when we draw $s_{2}$ samples from the distribution we expect $|T|/|R|$ fraction of them fall into $T$ . So, the expected value of $|Y|$ is $s_{2}\cdot|T|/|R|$ . We defer the asymptotic complexity analysis of this case to the end of our proof.

Soundness case: Assume $p$ is $\epsilon$ -far from being a monotone distribution. Consider all the violating edges $(u,v)$ in $E$ for which $p(u)$ is greater than $p(v)$ . By Lemma 6.1, there exists a set of edges, namely $M$ , that form a matching, and we have:

[TABLE]

Note that without loss of generality one can assume $M$ only has violating edges, since removing non-violating edges only makes the left hand side larger. By the uniformity assumption for $p$ , $p(u)-p(v)$ is exactly $1/|R|$ . Thus, by the above inequality, we have $|M|/|R|$ is at least $\epsilon$ .

Since there are $|M|$ vertices in $V_{B}$ that belong to the matching, $|B\cap M|$ is a random variable distributed according to the binomial distribution $\mathrm{\mathbf{Bin}}(s_{1},{|M|}/{|R|})$ , we have that

[TABLE]

Using Chebyshev’s inequality and the fact that $|B\cap M|$ is a binomial distribution, we have

[TABLE]

Thus, with high probability, $B$ contains at least $\epsilon s_{1}/2$ endpoints in $M$ . Note that the neighbor set of $B$ contains the other endpoints of the edges in the matching $M$ . Thus, $T$ contains at least $|B\cap M|$ vertices of zero probability, which implies that the size of $T$ has to be at least $\epsilon s_{1}/2$ . Hence, for sufficiency large $n$ , the probability that $p$ gets rejected due to the condition $|T|\leq\epsilon s_{1}/2$ is negligible.

Consider the second set of samples we draw in the algorithm $S_{2}$ . Clearly, the size of $Y\coloneqq T\cap S_{2}$ is a binomial random variable drawn from $\mathrm{\mathbf{Bin}}(s_{2},|T\cap R|/|R|)$ . However, we show that $\epsilon\coloneqq\epsilon s_{1}/(2|T|)$ fraction of the elements in $T$ have zero probability. Thus, $|T\cap R|/|R|$ is at most $(1-\epsilon^{\prime})|T|/|R|$ while in the completeness case it is $|T|/|R|$ . So, we only need to estimate the bias of a Bernoulli random variable up to an additive error of $\epsilon^{\prime\prime}\coloneqq\epsilon^{\prime}|T|/(2|R|)$ . By Hoeffding bound, we only need to draw $O(1/{\epsilon^{\prime\prime}}^{2})$ samples to distinguish the two cases with high probability which implies:

[TABLE]

Thus, with high probability, we distinguish them correctly.

7.5 Upper bound via trying all matchings

In this section we present a simple upper bound for the problem of monotonicity testing on bipartite graphs. Let $\mathcal{M}$ be the number of pairs of subsets $(S_{t},S_{b})$ of top and bottom elements respectively for which there exists a perfect matching between them. The algorithm is the following:

Theorem 7.12.

We can test whether a distribution $p$ over a bipartite graph $G$ with $n$ vertices is monotone or $\epsilon$ -far from any monotone distribution with success probability $2/3$ , using $O((\log M)/\epsilon^{2})$ samples, where $M$ is the number of pairs of subsets of top and bottom elements respectively for which there exists a perfect matching between them. That is, $O(n/\epsilon^{2})$ samples for a worst case graph $G$ .

Proof.

Let $w_{t}$ and $w_{b}$ denote the probability mass of $S_{t}$ and $S_{b}$ respectively. Note that if we use $O(1/\epsilon^{2})$ samples, we can estimate $w_{t}$ and $w_{b}$ within an additive error of $\epsilon/8$ . Thus, we can estimate the difference of the two with error of $\epsilon/4$ with a constant probability. We can amplify the probability of the correctness, by repeating the estimation and taking the median of them. Therefore, for each pair of subsets, the probability that the algorithm fails to estimate the difference of $w_{b}$ and $w_{t}$ within an error of $\epsilon/4$ is at most $O(\frac{1}{M})$ . By union bound, we distinguish whether $w_{b}-w_{t}$ is at least $\epsilon$ or at most zero by comparing the $\hat{w}_{b}-\hat{w}_{t}$ with $\epsilon/2$ , with a constant success probability.

Now, if $p$ is $\epsilon$ -far from being monotone with respect to the graph $G$ , there exists a matching such that the total difference between the probabilities of the bottom and the top elements, $w_{b}-w_{t}$ is at least $\epsilon$ by Lemma 6.1. Thus, in one of the iteration, we will consider this matching, and output reject. Also, if $p$ is monotone with respect to the graph $G$ , there is no violating edge. Therefore, for each pair $S_{t}$ and $S_{b}$ , we have $w_{b}-w_{t}\leq 0$ . Thus, in no iteration we output reject, and the distribution will be accepted at the end.

Lastly, since there are at most $2^{n_{t}}\cdot 2^{n_{b}}=2^{n_{t}+n_{b}}=2^{n}$ pairs of subsets where $n_{t},n_{b}$ is the total number of top and bottom elements respectively, we conclude that the sample complexity is $O(n/\epsilon^{2})$ . ∎

Remark:

Note that in order to execute the above algorithm, it is not required to know the quantity $M$ in advance. We can instead draw more samples and update all our estimates at the same time to sufficiently reduce the error probability for each estimate for the union bound to work.

Bibliography29

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[ACS 10] Michal Adamaszek, Artur Czumaj, and Christian Sohler. Testing monotone continuous distributions on high-dimensional real cubes. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2010, Austin, Texas, USA, January 17-19, 2010 , pages 56–65, 2010.
2[ADK 15] Jayadev Acharya, Constantinos Daskalakis, and Gautam Kamath. Optimal testing for properties of distributions. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada , pages 3591–3599, 2015.
3[AJOS 13] Jayadev Acharya, Ashkan Jafarpour, Alon Orlitsky, and Ananda Theertha Suresh. A competitive test for uniformity of monotone distributions. In Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2013, Scottsdale, AZ, USA, April 29 - May 1, 2013 , pages 57–65, 2013.
4[BB 16] Aleksandrs Belovs and Eric Blais. A polynomial lower bound for testing monotonicity. In Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing (STOC) , pages 1021–1032, 2016.
5[BCS 18] Hadley Black, Deeparnab Chakrabarty, and C. Seshadhri. A o ( d ) ⋅ ⋅ \cdot polylog n monotonicity tester for boolean functions over the hypergrid [ n ] d d {}^{\mbox{\emph{d}}} . In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) , pages 2133–2151, 2018.
6[BDKR 05] Tugkan Batu, Sanjoy Dasgupta, Ravi Kumar, and Ronitt Rubinfeld. The complexity of approximating the entropy. SIAM J. Comput. , 35(1):132–150, 2005.
7[BFRV 10] Arnab Bhattacharyya, Eldar Fischer, Ronitt Rubinfeld, and Paul Valiant. Testing monotonicity of distributions over general partial orders. Electronic Colloquium on Computational Complexity (ECCC) , 17:27, 2010.
8[BKR 04] Tugkan Batu, Ravi Kumar, and Ronitt Rubinfeld. Sublinear algorithms for testing monotone and unimodal distributions. In Proceedings of the Thirty-sixth Annual ACM Symposium on Theory of Computing , STOC ’04, pages 381–390, New York, NY, USA, 2004. ACM.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Towards Testing Monotonicity of Distributions Over General Posets

Abstract

1 Introduction

Our results and approaches:

Other related work

2 Preliminaries

Monotonicity.

Definition 2.1**.**

Bigness.

Definition 2.2**.**

Remark 2.3**.**

3 Overview of Our Techniques

3.1 A lower bound for the bigness testing problem

3.2 From bigness lower bounds to monotonicity lower bounds

Matching poset.

Hypercube poset.

3.3 Reduction from general posets to bipartite graphs

3.4 Upper bounds results

Bigness testing.

Monotonicity testing for matchings.

Monotonicity testing for bounded-degree bipartite graphs.

Testing monotonicity of distributions that are uniform on a subset of the domain.

Upper bound via trying all matchings.

4 A Lower Bound for the Bigness Testing Problem

Proof.

Lemma 4.1**.**

Lemma 4.2**.**

Proof.

4.1 Proof of Lemma 4.1

Proof.

4.2 Proof of Lemma 4.2

Proof.

5 From Bigness to Monotonicity

5.1 Monotonicity testing on a matching poset

Theorem 5.1**.**

Proof.

Corollary 5.2**.**

5.2 Monotonicity testing on a hypercube poset

5.2.1 General lower bound for monotonicity testing on a hypercube poset

Theorem 5.3**.**

Proof.

5.2.2 Applications of Theorem 5.3

Corollary 5.4**.**

Proof.

Corollary 5.5**.**

Proof.

6 Reduction from General Posets to Bipartite Graphs

Proof.

Proof.

6.1 Proof of auxiliary lemmas

Lemma 6.1**.**

Proof.

Lemma 6.2**.**

Proof.

Theorem 6.3** (Ghouila-Houri Characterization [GH62]).**

7 Algorithms with Sublinear Sample Complexity

7.1 An Algorithm for Bigness Testing

Definition 7.1**.**

Theorem 7.2** ([VV17, VV16]).**

Corollary 7.3**.**

Proof.

7.2 An Algorithm for Testing Monotonicity on Matchings

Definition 7.4** ([VV17]).**

Definition 7.5** ([VV17]).**

Lemma 7.6**.**

Proof.

Theorem 7.7** (Theorem 5.6 of [VV17]).**

Theorem 7.8**.**

Proof.

7.3 An Algorithm for Testing Monotonicity on Bounded Degree Bipartite Graphs with Sub-linear Sample Complexity

The matching poset G′G^{\prime}G′.

Distribution p′p^{\prime}p′ over G′G^{\prime}G′.

Lemma 7.9**.**

Definition 2.1.

Definition 2.2.

Remark 2.3.

Lemma 4.1.

Lemma 4.2.

Theorem 5.1.

Corollary 5.2.

Theorem 5.3.

Corollary 5.4.

Corollary 5.5.

Lemma 6.1.

Lemma 6.2.

Theorem 6.3 (Ghouila-Houri Characterization [GH62]).

Definition 7.1.

Theorem 7.2 ([VV17, VV16]).

Corollary 7.3.

Definition 7.4 ([VV17]).

Definition 7.5 ([VV17]).

Lemma 7.6.

Theorem 7.7 (Theorem 5.6 of [VV17]).

Theorem 7.8.

The matching poset $G^{\prime}$ .

Distribution $p^{\prime}$ over $G^{\prime}$ .

Lemma 7.9.

Corollary 7.10.

Theorem 7.11.

Theorem 7.12.