Distributed Detection with Empirically Observed Statistics

Haiyun He; Lin Zhou; Vincent Y. F. Tan

arXiv:1903.05819·cs.IT·February 13, 2020

Distributed Detection with Empirically Observed Statistics

Haiyun He, Lin Zhou, Vincent Y. F. Tan

PDF

TL;DR

This paper investigates distributed detection when the underlying distributions are unknown, using noisy empirical statistics, deriving optimal error exponents, and showing that a single channel becomes optimal as training length increases.

Contribution

It introduces a framework for distributed detection with empirically observed statistics and derives the optimal error exponents, extending classical detection results to unknown distributions.

Findings

01

Optimal type-II error exponent derived for binary detection.

02

Using one channel is optimal as training length ratio tends to infinity.

03

Numerical evidence suggests one channel remains optimal for finite training lengths.

Abstract

Consider a distributed detection problem in which the underlying distributions of the observations are unknown; instead of these distributions, noisy versions of empirically observed statistics are available to the fusion center. These empirically observed statistics, together with source (test) sequences, are transmitted through different channels to the fusion center. The fusion center decides which distribution the source sequence is sampled from based on these data. For the binary case, we derive the optimal type-II error exponent given that the type-I error decays exponentially fast. The type-II error exponent is maximized over the proportions of channels for both source and training sequences. We conclude that as the ratio of the lengths of training to test sequences $α$ tends to infinity, using only one channel is optimal. By calculating the derived exponents numerically, we…

Figures25

Click any figure to enlarge with its caption.

Equations374

a_{j}^{(n)} := \frac{\sum _{i \in [n]} \mathbbm 1 { h ( i ) = j }}{n}, b_{j}^{(n)} := \frac{\sum _{i \in [α n]} \mathbbm 1 { g ( i ) = j }}{α n} .

a_{j}^{(n)} := \frac{\sum _{i \in [n]} \mathbbm 1 { h ( i ) = j }}{n}, b_{j}^{(n)} := \frac{\sum _{i \in [α n]} \mathbbm 1 { g ( i ) = j }}{α n} .

a_{j} := n \to \infty lim a_{j}^{(n)}, b_{j} := n \to \infty lim b_{j}^{(n)}, \forall j \in [K] .

a_{j} := n \to \infty lim a_{j}^{(n)}, b_{j} := n \to \infty lim b_{j}^{(n)}, \forall j \in [K] .

β_{ν} (γ, P_{1}, P_{2} ∣ a, b, W) = P_{ν} {γ (Z^{n}, \tilde{Y}_{1}^{N}, \tilde{Y}_{2}^{N}) \neq = H_{ν}},

β_{ν} (γ, P_{1}, P_{2} ∣ a, b, W) = P_{ν} {γ (Z^{n}, \tilde{Y}_{1}^{N}, \tilde{Y}_{2}^{N}) \neq = H_{ν}},

E^{*} (n, α, P_{1}, P_{2}, λ ∣ a, b, W)

E^{*} (n, α, P_{1}, P_{2}, λ ∣ a, b, W)

:= sup {E \in R_{+} : \exists γ s.t. β_{2} (γ, P_{1}, P_{2}) \leq exp (- n E) \mbox an d

β_{1} (γ, \tilde{P}_{1}, \tilde{P}_{2}) \leq exp (- nλ), \forall (\tilde{P}_{1}, \tilde{P}_{2}) \in P (X)^{2}} .

\mathrm{GJS}(\tilde{Q},Q,\alpha):=D\Big{(}Q\Big{\|}\frac{Q+\alpha\tilde{Q}}{1+\alpha}\Big{)}+\alpha D\Big{(}\tilde{Q}\Big{\|}\frac{Q+\alpha\tilde{Q}}{1+\alpha}\Big{)}.

\mathrm{GJS}(\tilde{Q},Q,\alpha):=D\Big{(}Q\Big{\|}\frac{Q+\alpha\tilde{Q}}{1+\alpha}\Big{)}+\alpha D\Big{(}\tilde{Q}\Big{\|}\frac{Q+\alpha\tilde{Q}}{1+\alpha}\Big{)}.

LD (Q, \tilde{Q}_{1}, \tilde{Q}_{2}, P, \tilde{P}_{1}, \tilde{P}_{2} ∣ α, a, b, W)

LD (Q, \tilde{Q}_{1}, \tilde{Q}_{2}, P, \tilde{P}_{1}, \tilde{P}_{2} ∣ α, a, b, W)

\displaystyle:=\sum_{k\in[K]}\big{(}a_{k}D(Q_{k}\|PW_{k})+\sum_{i\in[2]}\alpha b_{k}D(\tilde{Q}_{i,k}\|\tilde{P}_{i}W_{k})\big{)},

\displaystyle\mathcal{Q}_{\lambda}(\alpha,\mathbf{a},\mathbf{b},\mathcal{W}):=\bigg{\{}(\mathbf{Q},\tilde{\mathbf{Q}}_{1},\tilde{\mathbf{Q}}_{2})\in\mathcal{P}([L])^{3K}:

\displaystyle\mathcal{Q}_{\lambda}(\alpha,\mathbf{a},\mathbf{b},\mathcal{W}):=\bigg{\{}(\mathbf{Q},\tilde{\mathbf{Q}}_{1},\tilde{\mathbf{Q}}_{2})\in\mathcal{P}([L])^{3K}:

\displaystyle\min_{(\tilde{P},P)\in\mathcal{P}(\mathcal{X})^{2}}\mathrm{LD}(\mathbf{Q},\tilde{\mathbf{Q}}_{1},\tilde{\mathbf{Q}}_{2},\tilde{P},\tilde{P},P|\alpha,\mathbf{a},\mathbf{b},\mathcal{W})\!\leq\!\lambda\bigg{\}}.

f_{α} (P_{1}, P_{2} ∣ a, b, W)

f_{α} (P_{1}, P_{2} ∣ a, b, W)

:= (Q, \tilde{Q}_{1}, \tilde{Q}_{2}) \in Q_{λ} (α, a, b, W) min LD (Q, \tilde{Q}_{1}, \tilde{Q}_{2}, P_{2}, P_{1}, P_{2} ∣ α, a, b, W) .

n \to \infty lim E^{*} (n, α, P_{1}, P_{2}, λ ∣ a, b, W) = f_{α} (P_{1}, P_{2} ∣ a, b, W) .

n \to \infty lim E^{*} (n, α, P_{1}, P_{2}, λ ∣ a, b, W) = f_{α} (P_{1}, P_{2} ∣ a, b, W) .

γ (Z^{n}, \tilde{Y}_{1}^{N}, \tilde{Y}_{2}^{N})

γ (Z^{n}, \tilde{Y}_{1}^{N}, \tilde{Y}_{2}^{N})

\displaystyle=\left\{\begin{array}[]{cc}\mathrm{H}_{1}&\min_{\tilde{P},P}\mathrm{LD}\big{(}\mathbf{T}_{\mathbf{Z}^{n\mathbf{a}}},\mathbf{T}_{\tilde{\mathbf{Y}}_{1}^{N\mathbf{b}}},\mathbf{T}_{\tilde{\mathbf{Y}}_{2}^{N\mathbf{b}}},\tilde{P},\tilde{P},P\big{)}\leq\lambda,\\ \mathrm{H}_{2}&\text{otherwise},\end{array}\right.

\displaystyle\min_{\tilde{P},P}\mathrm{LD}\big{(}\mathbf{T}_{\mathbf{Z}^{n\mathbf{a}}},\mathbf{T}_{\tilde{\mathbf{Y}}_{1}^{N\mathbf{b}}},\mathbf{T}_{\tilde{\mathbf{Y}}_{2}^{N\mathbf{b}}},\tilde{P},\tilde{P},P\,\big{|}\,\alpha,\mathbf{a},\mathbf{b},\{I_{L}\}\big{)}

\displaystyle\min_{\tilde{P},P}\mathrm{LD}\big{(}\mathbf{T}_{\mathbf{Z}^{n\mathbf{a}}},\mathbf{T}_{\tilde{\mathbf{Y}}_{1}^{N\mathbf{b}}},\mathbf{T}_{\tilde{\mathbf{Y}}_{2}^{N\mathbf{b}}},\tilde{P},\tilde{P},P\,\big{|}\,\alpha,\mathbf{a},\mathbf{b},\{I_{L}\}\big{)}

= \tilde{P}, P min D (T_{Z^{n}} ∥ \tilde{P}) + α D (T_{Y_{1}^{N}} ∥ \tilde{P}) + α D (T_{Y_{2}^{N}} ∥ P)

= GJS (T_{\tilde{Y}_{1}^{N}}, T_{Z^{n}}, α),

γ (Z^{n}, \tilde{Y}_{1}^{N}, \tilde{Y}_{2}^{N}) = {H_{1} H_{2} GJS (T_{\tilde{Y}_{1}^{N}}, T_{Z^{n}}, α) \leq λ, otherwise,

γ (Z^{n}, \tilde{Y}_{1}^{N}, \tilde{Y}_{2}^{N}) = {H_{1} H_{2} GJS (T_{\tilde{Y}_{1}^{N}}, T_{Z^{n}}, α) \leq λ, otherwise,

n \to \infty lim E^{*} (n, α, P_{1}, P_{2}, λ ∣ a, b, W)

n \to \infty lim E^{*} (n, α, P_{1}, P_{2}, λ ∣ a, b, W)

= (Q, \tilde{Q}) \in P (Z)^{2} : GJS (\tilde{Q}, Q, α) \leq λ min D (Q ∥ P_{2}) + α D (\tilde{Q} ∥ P_{1}) .

f_{α} (a, b, λ) := f_{α} (P_{1}, P_{2} ∣ a, b, W) .

f_{α} (a, b, λ) := f_{α} (P_{1}, P_{2} ∣ a, b, W) .

f_{α}^{*} (λ) := (a, b) \in P ([K])^{2} max f_{α} (a, b, λ)

f_{α}^{*} (λ) := (a, b) \in P ([K])^{2} max f_{α} (a, b, λ)

P (\tilde{P} ∣ v, W)

P (\tilde{P} ∣ v, W)

\displaystyle\qquad\forall~{}k\in[K],v_{k}\|PW_{k}-\tilde{P}W_{k}\|_{\infty}=0\big{\}}.

κ (Q, P_{1} ∣ a, b, W) := \tilde{P} \in P (P_{1} ∣ b, W) min k \in [K] \sum a_{k} D (Q_{k} ∥ \tilde{P} W_{k}) .

κ (Q, P_{1} ∣ a, b, W) := \tilde{P} \in P (P_{1} ∣ b, W) min k \in [K] \sum a_{k} D (Q_{k} ∥ \tilde{P} W_{k}) .

α \to \infty lim f_{α} (a, b, λ) = f_{\infty} (a, b, λ)

α \to \infty lim f_{α} (a, b, λ) = f_{\infty} (a, b, λ)

:= Q \in P ([L])^{K} : κ (Q, P_{1} ∣ a, b, W) \leq λ min k \in [K] \sum a_{k} D (Q_{k} ∥ P_{2} W_{k}) .

(a, b) \in P ([K])^{2} sup f_{\infty} (a, b, λ) = k \in [K] max f_{\infty} (e_{k}, e_{k}, λ),

(a, b) \in P ([K])^{2} sup f_{\infty} (a, b, λ) = k \in [K] max f_{\infty} (e_{k}, e_{k}, λ),

G_{α} (a, b)

G_{α} (a, b)

\displaystyle\!\!:=\min_{(\tilde{P},P)\in\mathcal{P}(\mathcal{X})^{2}}\sum_{k\in[K]}\Big{(}a_{k}D(P_{2}W_{k}\|\tilde{P}W_{k})

\displaystyle\!\!\qquad+\alpha b_{k}D(P_{1}W_{k}\|\tilde{P}W_{k})+\alpha b_{k}D(P_{2}W_{k}\|PW_{k})\Big{)}

\displaystyle\!\!=\min_{\tilde{P}\in\mathcal{P}(\mathcal{X})}\sum_{k\in[K]}\Big{(}a_{k}D(P_{2}W_{k}\|\tilde{P}W_{k})+\alpha b_{k}D(P_{1}W_{k}\|\tilde{P}W_{k})\Big{)}.

λ = G_{α} (a, b) .

λ = G_{α} (a, b) .

f_{α} (a, b, λ) = 0.

f_{α} (a, b, λ) = 0.

n \to \infty lim E_{V_{I}}^{*} (n, α, P_{1}, P_{2}, λ ∣ a, b, W)

n \to \infty lim E_{V_{I}}^{*} (n, α, P_{1}, P_{2}, λ ∣ a, b, W)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Distributed Detection with

Empirically Observed Statistics

Haiyun He, Student Member, IEEE Lin Zhou, *Member, IEEE * Vincent Y. F. Tan, Senior Member, IEEE This work is funded by a Singapore National Research Foundation Fellowship (R-263-000-D02-281) and the Research Scholar Budget (RSB) from NUS (C-261-000-207-532 and C-261-000-005-001).This paper was presented in part at the IEEE Information Theory Workshop in Visby, Gotland, Sweden, 2019.H. He and V. Y. F. Tan are with the Department of Electrical and Computer Engineering, National University of Singapore (NUS) (Emails: [email protected] and [email protected]). V. Y. F. Tan is also with the Department of Mathematics, NUS. L. Zhou is with the Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor (Email: [email protected]).Copyright (c) 2017 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to [email protected].

Abstract

Consider a distributed detection problem in which the underlying distributions of the observations are unknown; instead of these distributions, noisy versions of empirically observed statistics are available to the fusion center. These empirically observed statistics, together with source (test) sequences, are transmitted through different channels to the fusion center. The fusion center decides which distribution the source sequence is sampled from based on these data. For the binary case, we derive the optimal type-II error exponent given that the type-I error decays exponentially fast. The type-II error exponent is maximized over the proportions of channels for both source and training sequences. We conclude that as the ratio of the lengths of training to test sequences $\alpha$ tends to infinity, using only one channel is optimal. By calculating the derived exponents numerically, we conjecture that the same is true when $\alpha$ is finite under certain conditions. We relate our results to the classical distributed detection problem studied by Tsitsiklis, in which the underlying distributions are known. Finally, our results are extended to the case of $m$ -ary distributed detection with a rejection option.

Index Terms:

Distributed detection, Error exponents, Training samples, Hypothesis testing

I Introduction

The problem of distributed detection [1, 2] has a plethora of applications, such as in distributed radar and sensor networks; see [3] and references therein for an overview. In these examples, the observed information at local sensors (processors) needs to be quantized before being sent to a fusion center. The fusion center then performs a specific inference task such as hypothesis testing.

In the traditional distributed detection problem as studied in [1, 2, 3, 4], the underlying generating distributions are available at the fusion center and one is tasked to design a test based on observations as well as the known distributions. However, in practical applications, the fusion center has no knowledge of the underlying distributions and may only be given quantized or noisy observations and labelled training sequences (in place of the generating distributions). This leads to new challenges in designing optimal tests.

Motivated by these practical issues and inspired by [5, 2], in this paper, we adopt a contemporary statistical learning approach and consider the distributed detection problem as shown, for the binary case, in Figure 1 in which the distributions of sensor observations are unknown. We term this problem as distributed detection with empirically observed statistics. We assume that the sensor observations are transmitted to the fusion center via different channels, which can also be regarded as compressors. Labelled training sequences generated from the different underlying distributions are pre-processed then provided to the fusion center. Our aim is to derive fundamental performance limits of the classification problem as well as to potentially come up with the same conclusions as Tsitsiklis did in [2], i.e., to conclude that a small number of distinct channels or local decision rules suffices to attain the optimal error exponent.

I-A Main Contributions

In this paper, our main contributions are as follows.

Firstly, for the binary distributed detection problem, we derive the asymptotically optimal type-II error exponent when the type-I error exponent is lower bounded by a positive constant. In the achievability proof, we introduce a generalized version of Gutman’s test in [5] and prove that the so-designed test is asymptotically optimal.

Secondly, again restricting ourselves to the binary case, we discuss the optimal proportions of different channels that serve as pre-processors of the training and source sequences. Let $\alpha$ , a constant, denote the ratio between the length of the training sequence and the length of the source sequence. When $\alpha\to\infty$ , we provide a closed-form expression for the type-II error exponent and prove that using only one identical channel for both training and source sequences is asymptotically optimal. This mirrors Tsitsiklis’ result [2]. On the other hand, if $\alpha$ is sufficiently small, the type-II error exponent is identically equal to zero. When $\alpha$ does not take extreme values, by calculating the derived exponent numerically, we conjecture that using one channel for the training sequence and another (possibly the same one) for the source sequence is optimal under certain conditions.

Thirdly, we relate our results to the classical distributed detection problem in Tsitsiklis’ paper [2]. When $\alpha\to\infty$ , the true distributions can be estimated to arbitrary accuracy and we naturally recover the results in [2] for both the Neyman-Pearson and Bayesian settings.

Finally, we extend our analyses to consider an $m$ -ary distributed detection problem with rejection. We derive the asymptotically optimal type- $j$ rejection exponent for each $j\in[m]$ under the condition that all (undetected) error exponents are lower bounded by a positive constant $\lambda$ . In the achievability proof, we introduce a generalized version of Unnikrishnan’s test [6] by identifying an appropriate test statistic.

I-B Related Works

The distributed detection literature is vast and so it would be futile to review all existing works. This paper, however, is mainly inspired by [5] and [2]. In [5], Gutman proposed an asymptotically optimal type-based test for the binary classification problem. In [2], Tsitsiklis showed that using $\frac{1}{2}{m(m-1)}$ distinct local decision rules is optimal for $m$ -ary hypotheses testing in standard Bayesian and Neyman-Pearson distributed detection settings. Ziv [7] proposed a discriminant function related to universal data compression in the binary classification problem with empirically observed statistics. Chamberland and Veeravalli [8] considered the classical distributed detection in a sensor network with a multiple access channel, capacity constraint and additive noise. Liu and Sayeed [9] extended the type-based distributed detection to wireless networks. Chen and Wang [10] studied the anonymous heterogeneous distributed detection problem and quantified the price of anonymity. Tay, Tsitsiklis and Win studied tree-based variations of the distributed detection problem in the Bayesian [4] and Neyman-Pearson settings [11]. The authors also studied Bayesian distributed detection in a tandem sensor network[12]. The aforementioned works assume that the distributions are known.

Nguyen, Wainwright and Jordan[13] proposed a kernel-based algorithm for the nonparametric distributed detection problem with communication constraints. Similarly, Sun and Tay[14] also studied nonparametric distributed detection networks using kernel methods and in the presence of privacy constraints. While the problem settings in [13] and [14] involve training samples, the questions posed there are algorithmic in nature and hence, different. In particular, they do not involve fundamental limits in the spirit of this paper.

I-C Paper Outline

The rest of this paper is organized as follows. In Section II, we formulate the distributed detection problem with empirically observed statistics. We also present the optimal type-II error exponent and analyze the optimal proportion of channels and recover analogues of the results in [2] both for Neyman-Pearson and Bayesian settings. In Section III, we extend our results to the case in which there are $m\geq 2$ hypotheses and the rejection option is present. We conclude our discussion and present avenues for future work in Section IV. The proofs of our results are provided in the appendices.

I-D Notation

Random variables and their realizations are in upper (e.g., $X$ ) and lower case (e.g., $x$ ) respectively. All sets are denoted in calligraphic font (e.g., $\mathcal{X}$ ). We use $\mathcal{X}^{\mathrm{c}}$ to denote the complement of $\mathcal{X}$ . Let $X^{n}:=(X_{1},\ldots,X_{n})$ be a random vector of length $n$ . All logarithms are base $e$ . Given any two integers $(a,b)\in\mathbb{N}^{2}$ , we use $[a:b]$ to denote the set of integers $\{a,a+1,\ldots,b\}$ and use $[a]$ to denote $[1:a]$ . The set of all probability distributions on a finite set $\mathcal{X}$ is denoted as $\mathcal{P}(\mathcal{X})$ and the set of all conditional probability distributions from $\mathcal{X}$ to $\mathcal{Y}$ is denoted as $\mathcal{P}(\mathcal{Y}|\mathcal{X})$ . Given $P\in\mathcal{P}(\mathcal{X})$ and $V\in\mathcal{P}(\mathcal{Y}|\mathcal{X})$ , we use $PV$ to denote the marginal distribution on $\mathcal{Y}$ induced by $P$ and $V$ . We denote the support of $P$ as $\operatorname{supp}(P)$ . Given a vector $x^{n}=(x_{1},x_{2},\ldots,x_{n})\in\mathcal{X}^{n}$ , the type or empirical distribution [15] is denoted as $T_{x^{n}}(a)=\frac{1}{n}\sum_{i=1}^{n}\mathbbm{1}\{x_{i}=a\}$ where $a\in\mathcal{X}$ . We interchangeably use $\mathcal{T}_{T_{x^{n}}}^{n}$ and $\mathcal{T}_{x^{n}}:=\{\tilde{x}^{n}\in\mathcal{X}^{n}:T_{\tilde{x}^{n}}(a)=T_{x^{n}}(a),~{}\forall\,a\in\mathcal{X}\}$ to denote the type class of $T_{x^{n}}$ . Let $\mathcal{P}_{n}(\mathcal{X})$ denote the set of types with denominator $n$ . For two positive sequences $\{a_{n}\}$ and $\{b_{n}\}$ , we write $a_{n}\stackrel{{\scriptstyle.}}{{\leq}}b_{n}$ if $\limsup_{n\to\infty}\frac{1}{n}\log\frac{a_{n}}{b_{n}}\leq 0$ . The notations $\stackrel{{\scriptstyle.}}{{\geq}}$ and $\doteq$ are defined similarly. For a given vector $\mathbf{a}\in\mathbb{R}^{d}$ , we let $\mathrm{supp}(\mathbf{a}):=\{i\in[d]:a_{i}\neq 0\}$ denote the support of $\mathbf{a}$ .

II Binary Distributed Detection with Training Samples

In this section, we formulate the problem in which there are two hypotheses and instead of distributions, only training samples are available.

II-A Problem Formulation

We assume that there are $K$ fixed compressors or channels (these are called local decision rules in [2]), where for each $j\in[K]$ , the $j$ -th channel is $W_{j}\in\mathcal{P}(\mathcal{Z}|\mathcal{X})$ . This channel has input alphabet $\mathcal{X}=[M]$ and output alphabet $\mathcal{Z}=[L]$ . For notational simplicity, we assume that $|\mathcal{X}|=M<\infty$ but our results go through for uncountably infinite $\mathcal{X}$ as well. We let $\mathcal{W}:=\{W_{j}\}_{j\in[K]}$ be a fixed set of channels. Furthermore, let $h:[n]\mapsto[K]$ and $g:[N]\mapsto[K]$ to be functions that map the index of the test/training sample to the channel index.

The system model is as follows (see Figure 1). There are $n$ sensors and a source/test sequence $X^{n}$ generated i.i.d. according to some unknown distribution defined on $\mathcal{X}$ . For each $i\in[n]$ , the $i$ -th sensor observes $X_{i}\in\mathcal{X}$ and maps it to $Z_{i}$ using the channel $W_{h(i)}$ . The $Z_{i}$ ’s from all local sensors are transmitted to a fusion center. In addition to $Z_{i}$ ’s, the fusion center observes two noisy versions of training sequences $(Y_{1}^{N},Y_{2}^{N})\in\mathcal{X}^{2N}$ which are generated i.i.d. according to some unknown but fixed distributions $(P_{1},P_{2})\in\mathcal{P}(\mathcal{X})^{2}$ . The fusion center observes noisy sequences $(\tilde{Y}_{1}^{N},\tilde{Y}_{2}^{N})$ , where $\tilde{Y}_{1,i}\sim W_{g(i)}(\cdot|Y_{1,i})$ and $\tilde{Y}_{2,i}\sim W_{g(i)}(\cdot|Y_{2,i})$ for all $i\in[N]$ . With $(\tilde{Y}_{1}^{N},\tilde{Y}_{2}^{N})$ and $Z^{n}$ , the fusion center uses a decision rule $\gamma:[L]^{2N+n}\mapsto\rm\{H_{1},H_{2}\}$ to discriminate between the following two hypotheses:

•

$\mathrm{H}_{1}$ : the source sequence $X^{n}$ and the training sequence $Y_{1}^{N}$ are generated according to the same distribution;

•

$\mathrm{H}_{2}$ : the source sequence $X^{n}$ and the training sequence $Y_{2}^{N}$ are generated according to the same distribution.

We assume $N=\lceil\alpha n\rceil$ for some $\alpha\in\mathbb{R}_{+}$ .111We ignore the integer constraints of $(n,N)$ and write $N=\alpha n$ . For each $j\in[K]$ , we use $a_{j}^{(n)}$ and $b_{j}^{(n)}$ to denote the proportions of $[n]$ and $[N]$ in which the channel $W_{j}$ is used to process the source and training sequences respectively, i.e.,

[TABLE]

An example is given in Figure 2. Furthermore, we let $\mathbf{a}^{(n)}=(a_{1}^{(n)},\ldots,a_{K}^{(n)})$ and $\mathbf{b}^{(n)}=(b_{1}^{(n)},\ldots,b_{K}^{(n)})$ . We assume that the following limits exist:

[TABLE]

To avoid clutter in subsequent mathematical expressions, we abuse notation subsequently and drop the superscript $(n)$ in $\mathbf{a}^{(n)}$ and $\mathbf{b}^{(n)}$ in all non-asymptotic expressions, with the understanding that $\mathbf{a}$ (resp. $a_{j}$ ) appearing in a non-asymptotic expression should be interpreted as $\mathbf{a}^{(n)}$ (resp. $a_{j}^{(n)}$ ).

Given any decision rule $\gamma$ at the fusion center and any pair of distributions $(P_{1},P_{2})$ according to which the training sequences $(Y_{1}^{N},Y_{2}^{N})$ are generated, the performance metrics we consider are the type-I and type-II error probabilities

[TABLE]

where for $\nu\in[2]$ , we use $\mathbb{P}_{\nu}:=\Pr\{\cdot|\mathrm{H}_{\nu}\}$ to denote the joint distribution of $Z^{n}$ and $(\tilde{Y}_{1}^{N},\tilde{Y}_{2}^{n})$ under hypothesis $\mathrm{H}_{\nu}$ . In the remainder of this paper, we use $\beta_{\nu}(\gamma,P_{1},P_{2})$ to denote $\beta_{\nu}(\gamma,P_{1},P_{2}|\mathbf{a},\mathbf{b},\mathcal{W})$ if there is no risk of confusion.

Inspired by [5], in this paper, we are interested in the maximal type-II error exponent with respect to a pair of target distributions for any decision rule at the fusion center whose type-I error probability decays exponentially fast with a certain fixed exponential rate for all pairs of distributions, i.e., given any $\lambda\in\mathbb{R}_{+}$ , the optimal non-asymptotic type-II error exponent is

[TABLE]

II-B Definitions

To state our results succinctly, we begin by stating some somewhat non-standard definitions. Given any pair of distributions $(Q,\tilde{Q})\in\mathcal{P}([L])^{2}$ and any $\alpha\in\mathbb{R}_{+}$ , the generalized Jensen-Shannon divergence [16, Eqn. (3)] is defined as

[TABLE]

Let $\mathbf{Q}=(Q_{1},\ldots,Q_{K})\in\mathcal{P}([L])^{K}$ and $\tilde{\mathbf{Q}}_{i}=(\tilde{Q}_{i,1},\ldots,\tilde{Q}_{i,K})\in\mathcal{P}([L])^{K}$ where $i\in[2]$ be three collections of distributions. Given any $(\mathbf{Q},\tilde{\mathbf{Q}}_{1},\tilde{\mathbf{Q}}_{2})\in\mathcal{P}([L])^{3K}$ , any $(P,\tilde{P}_{1},\tilde{P}_{2})\in\mathcal{P}(\mathcal{X})^{3}$ , any $\alpha\in\mathbb{R}_{+}$ , any pair $(\mathbf{a},\mathbf{b})\in\mathcal{P}([K])^{2}$ , define the following linear combination of divergences

[TABLE]

and furthermore, given any $\lambda\in\mathbb{R}_{+}$ , define the following set of collections of distributions

[TABLE]

Finally, define the following minimum linear combination of divergences over the collections of distributions in $\mathcal{Q}_{\lambda}(\alpha,\mathbf{a},\mathbf{b},\mathcal{W})$ as

[TABLE]

II-C Main Results

The following theorem is our main result and presents a single-letter expression for the optimal type-II exponent.

Theorem 1.

Given any $(\lambda,\alpha)\in\mathbb{R}_{+}^{2}$ , any pair of target distributions $(P_{1},P_{2})\in\mathcal{P}(\mathcal{X})^{2}$ ,

[TABLE]

The proof of Theorem 1 is given in Appendix -A. Several remarks are in order.

Firstly, in the achievability proof of Theorem 1, we make use of the following test at the fusion center

[TABLE]

where we suppressed the dependence of $\mathrm{LD}$ on $(\alpha,\mathbf{a},\mathbf{b},\mathcal{W})$ and for each $k\in[K]$ , we use $Z^{na_{k}}$ to denote the collection of $Z_{i}$ where $i\in[n]$ satisfies $h(i)=k$ and similarly for $\tilde{Y}_{j}^{Nb_{k}}$ for $j\in[2]$ . Furthermore, we use $\mathbf{T}_{\mathbf{z}^{n\mathbf{a}}}$ to denote the vector of types $(T_{z^{na_{1}}},\ldots,T_{z^{na_{K}}})$ and use $\mathbf{T}_{\tilde{\mathbf{y}}_{j}^{N\mathbf{b}}}$ for $j\in[2]$ similarly. Theorem 1 indicates that the test in (II-C) is asymptotically optimal. The test in (II-C) basically compares a certain distance between $\mathbf{T}_{\mathbf{z}^{n\mathbf{a}}}$ and $\mathbf{T}_{\tilde{\mathbf{y}}_{1}^{N\mathbf{b}}}$ plus a bias term related to $\mathbf{T}_{\tilde{\mathbf{y}}_{2}^{N\mathbf{b}}}$ to a threshold $\lambda$ . When the distance is small enough, we declare that $\mathbf{T}_{\mathbf{z}^{n\mathbf{a}}}$ and $\mathbf{T}_{\tilde{\mathbf{y}}_{1}^{N\mathbf{b}}}$ are generated according to the same distribution; otherwise, we declare that they are not.

Secondly, the test in (II-C) is a generalization of Gutman’s test in [5]. To see this, we note that if we let $K=1$ , $M=L$ and consider the deterministic channel denoted as $W=I_{L}$ , the test in (II-C) reduces to Gutman’s test using $(Z^{n},\tilde{Y}_{1}^{N},\tilde{Y}_{2}^{N})$ since

[TABLE]

and the exponent in Theorem 1 reduces to the type-II exponent for binary classification [5, Thm. 3], i.e.,

[TABLE]

and

[TABLE]

Finally, to better understand the effect of not knowing the true distributions, we numerically plot the optimal type-II exponent $f_{\alpha}(P_{1},P_{2}|\mathbf{a},\mathbf{b},\mathcal{W})$ (defined in (8)) in Figure 3. As shown in Figure 3, the optimal type-II exponent $f_{\alpha}(P_{1},P_{2}|\mathbf{a},\mathbf{b},\mathcal{W})$ increases as $\alpha=\frac{N}{n}$ increases and converges to a threshold as $\alpha\to\infty$ . This threshold is the optimal error exponent when the true distributions are known [2, Theorem 2]. The gap $f_{\infty}-f_{\alpha}$ thus quantifies the loss due to the fact that the generating distributions are unknown and only training samples are available to the learner.

II-D Further Discussions on the Impact of the Proportions of Local Decision Rules $(\mathbf{a},\mathbf{b})$ on the Exponent

In this subsection, we discuss the choices of the proportion of local decision rules, denoted by $(\mathbf{a},\mathbf{b})$ , to achieve the optimal type-II exponent $f_{\alpha}(P_{1},P_{2}|\mathbf{a},\mathbf{b},\mathcal{W})$ . Throughout the section, we fix a pair of target distributions $(P_{1},P_{2})$ . For brevity, we define

[TABLE]

Since the type-II error exponent depends on $(\mathbf{a},\mathbf{b})$ , inspired by the result in [2] which states that one local decision rule is optimal for binary hypotheses testing (in the Neyman-Pearson and Bayesian settings), we can further optimize the type-II error exponent with respect to the design of the proportion of channels (encoded in $\mathbf{a}$ and $\mathbf{b}$ ) and thus study

[TABLE]

and the corresponding optimizers $\mathbf{a}^{*}$ and $\mathbf{b}^{*}$ for different values of $\alpha$ . For this purpose, given any vector $\mathbf{v}\in\mathcal{P}([K])$ and any distribution $\tilde{P}\in\mathcal{P}(\mathcal{X})$ , define

[TABLE]

Note that $\tilde{P}\in\mathcal{P}(\tilde{P}|\mathbf{v},\mathcal{W})$ and if $\mathrm{supp}(\mathbf{v})=[K]$ , then $PW_{k}=\tilde{P}W_{k}$ for all $k\in[K]$ .

Furthermore, given any $\mathbf{Q}\in\mathcal{P}([L])^{K}$ , any $P_{1}\in\mathcal{P}(\mathcal{X})$ and any pair $(\mathbf{a},\mathbf{b})\in\mathcal{P}([K])^{2}$ , let

[TABLE]

Lemma 2.

The function $f_{\alpha}(\mathbf{a},\mathbf{b},\lambda)$ satisfies

[TABLE]

The proof of Lemma 2 is provided in Appendix -B.

We say that $\mathbf{a}\in\mathcal{P}([K])$ is deterministic if there exists a $j\in[K]$ such that $a_{j}=1$ . Let $\mathbf{e}_{j}$ be the $j$ -th standard basis vector in $\mathbb{R}^{K}$ , i.e., the vector $\mathbf{e}_{j}$ equals $1$ in the $j$ -th location and [math] in other locations.

Corollary 3.

*Given any $\lambda\in\mathbb{R}_{+}$ , we have *

[TABLE]

and thus the maximizers $(\mathbf{a}^{*},\mathbf{b}^{*})$ for $f_{\infty}(\mathbf{a},\mathbf{b},\lambda)$ satisfy that $(\mathbf{a}^{*},\mathbf{b}^{*})$ are both deterministic and $\mathbf{a}^{*}=\mathbf{b}^{*}$ .

The proof of Corollary 3 is provided in Appendix -C. Corollary 3 says that when the length of the training sequence is much longer than the test sequence, it is optimal to use a single local decision rule or channel to pre-process the training data and source sequence; this is analogous to [2, Theorem 1].

Given any $\alpha\in\mathbb{R}_{+}$ and any $(\mathbf{a},\mathbf{b})\in\mathcal{P}([K])^{2}$ , let

[TABLE]

Given any $\lambda\in\mathbb{R}_{+}$ , let $\alpha_{0}(\mathbf{a},\mathbf{b},\lambda)$ be the solution (in $\alpha$ ) to the following equation

[TABLE]

Since $\mathrm{G}_{\alpha}(\mathbf{a},\mathbf{b})$ is an increasing function of $\alpha$ and $\mathrm{G}_{0}(\mathbf{a},\mathbf{b})=0$ , for any $\lambda\in\mathbb{R}_{+}$ , we have $\alpha_{0}(\mathbf{a},\mathbf{b},\lambda)>0$ unless $\lambda=0$ .

Lemma 4.

Given any $(\mathbf{a},\mathbf{b})\in\mathcal{P}([K])^{2}$ and any $\lambda\in\mathbb{R}_{+}$ , if $\alpha\in[0,\alpha_{0}(\mathbf{a},\mathbf{b},\lambda)]$ , then

[TABLE]

We verified Lemma 4 numerically by plotting $f_{\alpha}(\mathbf{a},\mathbf{b},\lambda)$ as a function of $\alpha$ when $\alpha$ is small for certain values of $(\mathbf{a},\mathbf{b},\lambda)$ in Figure 4. The proof of Lemma 4 is straightforward since when $\alpha\leq\alpha_{0}(\mathbf{a},\mathbf{b},\lambda)$ , $(\mathbf{Q}^{*},\tilde{\mathbf{Q}}_{1}^{*},\tilde{\mathbf{Q}}_{2}^{*})\in\mathcal{Q}_{\lambda}(\alpha,\mathbf{a},\mathbf{b},\mathcal{W})$ where $Q_{k}^{*}=P_{2}W_{k}$ , $\tilde{Q}_{1,k}^{*}=P_{1}W_{k}$ and $\tilde{Q}_{2,k}^{*}=P_{2}W_{k}$ and thus $\mathrm{LD}(\mathbf{Q}^{*},\tilde{\mathbf{Q}}_{1}^{*},\tilde{\mathbf{Q}}_{2}^{*},P_{2},P_{1},P_{2}|\alpha,\mathbf{a},\mathbf{b},\mathcal{W})=0$ . The intuition is that when $\alpha$ is small enough, for any $\lambda>0$ , the decision rule $\gamma$ in (II-C) always declares $\mathrm{H}_{1}$ , which means that $\beta_{2}(\gamma,P_{1},P_{2})=1$ , so the corresponding exponent is identically [math].

II-E * Numerical Study on Optimal Proportions of Local Decision Rules*

In the following, we present numerical results to illustrate the properties of the optimal proportions of local decision rules $(\mathbf{a}_{\alpha}^{*},\mathbf{b}_{\alpha}^{*}):=\operatorname*{arg\,max}_{\mathbf{a},\mathbf{b}}f_{\alpha}(\mathbf{a},\mathbf{b},\lambda)$ when $\alpha\in(\alpha_{0}(\mathbf{a},\mathbf{b},\lambda),\infty)$ ; that is, $\alpha$ is moderate.

When $K=2$ , regardless of the stochasticity of the channels in $\mathcal{W}$ , we find that the maximal value of $f_{\alpha}(\mathbf{a},\mathbf{b},\lambda)$ always lies at a corner point of the feasible set of $(\mathbf{a},\mathbf{b})$ . See Figure 5 for numerical examples. When $K\geq 3$ , we find that the results are more involved and it is not necessarily optimal to use only one local decision rule. To clarify our observations, we first describe cases where our numerical calculations of the exponent suggest that it is optimal to use only one local decision rule to achieve the optimal type-II error exponent; this is analogous to [2, Theorem 2]. We then consider other cases and briefly discuss why it is not always optimal to use one local decision rule.

II-E1 When One Local Decision Rule is Optimal

In most practical distributed detection systems, the local decision rule at each sensor is a deterministic compressor or quantizer. However, under certain conditions, randomized local decision rules can be used to provide privacy [17, 18, 19] or to satisfy power constraints [20, Sec. IV]. We now describe a class of local decision rules for which the exponent can be simplified and numerical calculations of the exponents suggest that full diversity of local decision rules is unnecessary.

Let $\mathcal{V}_{\mathrm{I}}$ be the set of stochastic matrices (channels) with $M=|\mathcal{X}|$ rows and $L=|\mathcal{Z}|$ columns whose rows contain a permutation of the rows of $I_{L}$ , the $L\times L$ identity matrix. The set $\mathcal{V}_{\mathrm{I}}$ includes all deterministic mappings (e.g., Figure 6(b)) and a subset of stochastic mappings as long as for each $z\in\cal Z$ , there exists an $x_{z}\in\cal X$ that maps directly to it, as illustrated in Figure 6(a). Note that Tsitsiklis [2] considers only deterministic local decision rules, which certainly falls into the class $\mathcal{V}_{\mathrm{I}}$ . The definition is extended in the obvious way if $M=\infty$ (i.e., for all $z\in\mathcal{Z}$ , there exists $x_{z}\in\mathcal{X}$ such that $V(z|x_{z})=1$ ).

We assume that the second training sequence $Y_{2}^{N}$ is pre-processed by one local decision rule $V\in\mathcal{V}_{I}$ , i.e., $\tilde{Y}_{2,i}\sim V(\cdot|Y_{2,i})$ for all $i\in[N]$ . The channels $\{W_{j}\}_{j\in[K]}$ that are used to pre-process the test sequence $X^{n}$ and the first training sequence $Y_{1}^{N}$ are arbitrary. Under such a setting, we can simplify the asymptotically optimal error exponent and test (cf. (9) and (II-C)) as follows:

[TABLE]

where $\mathcal{Q}_{\lambda,{\mathcal{V}_{\mathrm{I}}}}(\alpha,\mathbf{a},\mathbf{b},\mathcal{W}):=\big{\{}(\mathbf{Q},\tilde{\mathbf{Q}})\in\mathcal{P}([L])^{K}:\min_{\tilde{P}\in\mathcal{P}(\mathcal{X})}\sum_{k\in[K]}\big{(}a_{k}D(Q_{k}\|\tilde{P}W_{k})+\alpha b_{k}D(\tilde{Q}_{k}\|\tilde{P}W_{k})\big{)}\leq\lambda\big{\}}$ , and $\gamma_{\mathcal{V}_{\mathrm{I}}}$ is given in (30) at the top of the next page.

By calculating $f_{\alpha}(\mathbf{a},\mathbf{b},\lambda)$ for various $(P_{1},P_{2})$ and $\mathcal{W}$ , we find that when $\alpha$ is moderate (i.e., neither $\leq\alpha_{0}$ nor $\infty$ ), the maximal value of $f_{\alpha}(\mathbf{a},\mathbf{b},\lambda)$ always lies at a corner point of the feasible set of $(\mathbf{a},\mathbf{b})$ , as shown in Figure 7. Additional numerical results are shown in Appendix -D. Inspired by these numerical results, we present the following conjecture:

Conjecture 5.

For all $\alpha,\lambda\in\mathbb{R}_{+}$ , the vectors $\mathbf{a}_{\alpha}^{*}$ and $\mathbf{b}_{\alpha}^{*}$ that maximize $f_{\alpha}(\mathbf{a},\mathbf{b},\lambda)$ are deterministic if the second training sequence $Y_{2}^{N}$ is pre-processed by a single channel $V\in\mathcal{V}_{I}$ .

II-E2 Results without Assumptions on Channels

In this subsection, we discuss the general case where there is no assumption (e.g., deterministic or membership in $\mathcal{V}_{\mathrm{I}}$ ) on local decision rules used to pre-process $X^{n}$ , $Y_{1}^{N}$ and $Y_{2}^{N}$ .

By calculating $f_{\alpha}(\mathbf{a},\mathbf{b},\lambda)$ for various $(P_{1},P_{2})$ and $\mathcal{W}$ , we find that when $K=3$ and $\alpha$ is moderate, the maximal value of $f_{\alpha}(\mathbf{a},\mathbf{b},\lambda)$ does not always lie at a corner point; instead, it may occur at non-corner points within the feasible set of $(\mathbf{a},\mathbf{b})$ . We illustrate this in Figure 8 for the case of stochastic $W_{1},W_{2},W_{3}$ and for some $\mathbf{b}=(b_{1},b_{2},b_{3})$ .

Our numerical results suggest that without the knowledge of true distributions, it is not always optimal to use only one identical channel to process the test and training samples, which differs from the result in Tsitsiklis’ paper [2]. The difference can be explained intuitively as follows. First, in [2], deterministic decision rules are considered while in our setting, we allow the channels to be stochastic. Second, when $\alpha$ is moderate, we are not able to estimate the true distributions with arbitrary accuracy using the training samples. Thus, to combat the randomness induced by the channels and to compensate for the loss of (full) knowledge of the true distributions, the fusion center may require more information; hence the need for more diversity in the local decision rules.

II-F Connections to Results in Distributed Detection

We discuss the connections between Theorem 1 and [2], which concerns distributed detection when the underlying distributions are known. Throughout this subsection, to emphasize the dependence of error probabilities on $(\mathbf{a},\mathbf{b})$ , we use $\beta_{\nu}(\gamma,P_{1},P_{2}|\mathbf{a},\mathbf{b})$ to denote the type- $\nu$ error probability with respect to distributions $(P_{1},P_{2})$ when test $\gamma$ is used at the fusion center.

We first consider the Neyman-Pearson setting [21, Sec. 11.8]. Given any $\varepsilon\in[0,1]$ , let $\Gamma_{\varepsilon}(\mathbf{a},\mathbf{b})$ be the set of tests satisfying that for all $(\tilde{P}_{1},\tilde{P}_{2})\in\mathcal{P}(\mathcal{X})^{2}$ ,

[TABLE]

Let the optimal type-II error probability subject to (31) be

[TABLE]

Note that $\beta_{2}^{*}(P_{1},P_{2})$ depends on $n$ , $\alpha$ and $\varepsilon$ but this dependence is suppressed for the sake of brevity.

Corollary 6.

Given any $V\in\mathcal{V}_{\mathrm{I}}$ and any $(P_{1},P_{2})\in\mathcal{P}(\mathcal{X})^{2}$ ,

[TABLE]

Proof sketch of Corollary 6.

The direct parts of the result are corollaries of Theorem 1, Lemma 2 and Corollary 3 by letting $\lambda=\frac{1}{n}\log\frac{1}{\varepsilon}\downarrow 0$ . The converse parts follow from [2, Theorem 2] where the distributions are known since one can never obtain better (larger) exponents with unknown distributions than with known distributions. Since the justifications are straightforward, we omit the details for the sake of brevity. ∎

We also consider the Bayesian setting. Assume the prior probabilities for $\mathrm{H}_{1}$ and $\mathrm{H}_{2}$ are $\pi_{1}$ and $\pi_{2}$ respectively. Clearly, $\pi_{1}+\pi_{2}=1$ . Given any $(\mathbf{a},\mathbf{b})\in\mathcal{P}_{n}([K])\times\mathcal{P}_{\alpha n}([K])$ and any $\mathcal{W}$ , let the Bayesian error probability be

[TABLE]

Furthermore, let the maximum Chernoff information between $P_{2}W_{k}$ and $P_{1}W_{k}$ be

[TABLE]

and let $\Gamma_{\rm{Bayes}}(\mathbf{a},\mathbf{b})$ be the set of tests at the fusion center satisfying that for all $(\tilde{P}_{1},\tilde{P}_{2})\in\mathcal{P}(\mathcal{X})^{2}$ ,

[TABLE]

Finally, let the optimal Bayesian error probability be

[TABLE]

Again $\mathrm{P}_{\mathrm{e}}^{*}(P_{1},P_{2})$ depends on both $n$ and $\alpha$ .

Corollary 7.

Given any $V\in\mathcal{V}_{\mathrm{I}}$ and any $(P_{1},P_{2})\in\mathcal{P}(\mathcal{X})^{2}$ ,

[TABLE]

Proof sketch of Corollary 7.

The direct parts of the following results are corollaries of Theorem 1, Lemma 2 and Corollary 3 by solving $\max_{k\in[K]}f_{\infty}(\mathbf{e}_{k},\mathbf{e}_{k},\lambda)=\lambda$ . The (strong) converse parts follow from [2]. ∎

Under the Bayesian setting, the exponents of the type-I and type-II error probabilities are equal [21, Thm. 11.9.1].

Note that Corollaries 6 and 7 are analogous to distributed detection [2] for the binary case under the Neyman-Pearson and Bayesian settings respectively where the true distributions $(P_{1},P_{2})$ are known. The intuition is that when the lengths of the training sequences are much longer than that of the source sequence (i.e., $\alpha\to\infty$ ), we can estimate the true distributions to arbitrary precision, i.e., as accurately as desired.

III $m$ -ary Distributed Detection with the Rejection Option and Training Samples

In this section, we generalize the binary distributed detection problem to the scenario in which we desire to discriminate between $m\geq 2$ hypotheses with the rejection option. Our main contribution here is the identification of an appropriate test statistic and test that achieves the optimal rejection exponent for a fixed lower bound on all error exponents.

III-A Problem Formulation

In the $m$ -ary distributed detection problem, there are $m$ training sequences $\{Y_{i}^{N}\}_{i\in[m]}$ , each generated i.i.d. according to an unknown distribution $P_{i}\in\mathcal{P}(\mathcal{X})$ . There are $n$ sensors. Each sensor observes a source symbol $X_{i}$ and compress/processes it into a noisy version $Z_{i}$ just as in the binary distributed detection problem. Given noisy training sequences $\{\tilde{Y}_{i}^{N}\}_{i\in[m]}$ and the compressed source sequence $Z^{n}$ , in which $\tilde{Y}_{i,j}\sim W_{g(j)}(\cdot|Y_{i,j})$ for all $i\in[m]$ and $j\in[N]$ , the fusion center uses a decision rule $\gamma:[L]^{mN+n}\mapsto\{\mathrm{H}_{1},\ldots,\mathrm{H}_{m},\mathrm{H}_{\mathrm{r}}\}$ to discriminate among the following $m+1$ hypotheses:

•

$\mathrm{H}_{j},\,\forall\,j\in[m]$ : the source sequence $X^{n}$ and the $j$ -th training sequence $Y_{j}^{N}$ are generated according to the same distribution;

•

$\mathrm{H}_{\mathrm{r}}$ : the source sequence $X^{n}$ is generated according to a distribution different from those which the training sequences are generated from and hence we reject all $\mathrm{H}_{j},\,\forall\,j\in[m]$ .

Thus, the decision rule $\gamma$ partitions the sample space $[L]^{mN+n}$ into $m+1$ disjoint regions: $m$ acceptance regions $\{\Lambda_{j}(\gamma)\}_{j\in[m]}$ , where $\Lambda_{j}(\gamma)$ favors hypothesis $\mathrm{H}_{j}$ , i.e.,

[TABLE]

and one rejection region $\Lambda_{\mathrm{r}}(\gamma):=(\cup_{j\in[m]}\Lambda_{j}(\gamma))^{\mathrm{c}}$ which favors hypothesis $\mathrm{H}_{\mathrm{r}}$ . Note that here we assume that all $m$ training sequences are processed with channels in $\mathcal{W}$ using the same index mapping function $g$ . That is, all the first components $Y_{1,1},\ldots,Y_{m,1}$ are passed through the same channel, which is one element from $\mathcal{W}$ . The same is true for all the other $N-1$ components.

For conciseness, we set $\mathbf{Y}^{N}=(Y_{1}^{N},\ldots,Y_{m}^{N})$ and use $\tilde{\mathbf{Y}}^{N}$ similarly. Furthermore, we set $\mathbf{P}=(P_{1},\ldots,P_{m})$ and use $\tilde{\mathbf{P}}$ and $\mathbf{Q}$ similarly. Recall the definition of $\mathbf{a},\mathbf{b}$ and the assumption that $N=\lceil\alpha n\rceil$ . Given any decision rule $\gamma$ at the fusion center and any tuple of distributions $\mathbf{P}$ , the performance metrics we consider are the error probabilities and the rejection probabilities for each $j\in[m]$ , i.e.,

[TABLE]

We use $\beta_{j}(\gamma,\mathbf{P})$ and $\beta_{\mathrm{r},j}(\gamma,\mathbf{P})$ in place of $\beta_{j}(\gamma,\mathbf{P}|\mathbf{a},\mathbf{b},\mathcal{W})$ and $\beta_{\mathrm{r},j}(\gamma,\mathbf{P}|\mathbf{a},\mathbf{b},\mathcal{W})$ respectively if there is no risk of confusion. For this setting, we are interested in tests that can simultaneously ensure exponential decay of the error probabilities under any hypothesis for any tuple of distributions and exponential decay of the rejection probabilities under each hypothesis for a particular tuple of distributions. To be concrete, given any tuple of distributions $\mathbf{P}$ and any $\lambda\in\mathbb{R}_{+}$ , we are interested in the following optimal exponent of the rejection probability under hypothesis $\mathrm{H}_{j}$ :

[TABLE]

We emphasize that in this formulation, under each hypothesis, the error exponent is at least $\lambda$ for all tuples of distributions.

III-B Main Results

Before presenting the main result, we present some preliminary definitions. Recall that for each $k\in[K]$ , we use $z^{na_{k}}$ to denote the collection of $z_{i}$ satisfying $h(i)=k$ , use $\mathbf{z}^{n\mathbf{a}}$ to denote $(z^{na_{1}},\ldots,z^{na_{K}})$ and use $\mathbf{T}_{\mathbf{z}^{n\mathbf{a}}}$ to denote the vector of types $(T_{z^{na_{1}}},\ldots,T_{z^{na_{K}}})$ . Similarly, for each $k\in[K]$ and $j\in[m]$ , we use the notations $\tilde{y}_{j}^{Nb_{k}}$ , $\tilde{\mathbf{y}}_{j}^{N\mathbf{b}}$ and $\mathbf{T}_{\tilde{\mathbf{y}}_{j}^{N\mathbf{b}}}$ . Given any tuple of distributions $\mathbf{P}=(P_{1},\ldots,P_{m})\in\mathcal{P}(\mathcal{X})^{m}$ and any $j\in[m]$ , define the following linear combination of divergences

[TABLE]

and furthermore, let

[TABLE]

Note that $\mathrm{LD}_{j}^{[m]}(\cdot)$ is a slight generalization of $\mathrm{LD}(\cdot)$ in (6).

In the following, we will see that $\widetilde{\mathrm{LD}}_{j}(\cdot)$ is an appropriate test statistic that will be used in the achievability proof and an optimized version of $\mathrm{LD}_{j}^{[m]}(\cdot)$ is the corresponding exponent. Finally, given any $(i,l)\in[m]^{2}$ satisfying $i\neq l$ , we define the following set of the collections of distributions:

[TABLE]

Note that if we choose $(\mathbf{Q},\{\tilde{\mathbf{Q}}_{i}\}_{i\in[m]})\in\mathcal{P}(\mathcal{X})^{(m+1)K}$ such that $\mathbf{Q}=\tilde{\mathbf{Q}}_{1}=\ldots=\tilde{\mathbf{Q}}_{m}$ and $Q_{1}=P_{0}W_{1},~{}Q_{2}=P_{0}W_{2},\ldots,Q_{K}=P_{0}W_{K}$ for any $P_{0}\in\mathcal{P}(\mathcal{X})$ , then for all $j\in[m]$ , we have

[TABLE]

and

[TABLE]

which implies that $\widetilde{\mathrm{LD}}_{i}(\mathbf{Q},\{\tilde{\mathbf{Q}}_{i}\}_{i\in[m]})=0$ for all $i\in[m]$ . Thus, $\mathcal{Q}_{\lambda,i,l}(\alpha,\mathbf{a},\mathbf{b},\mathcal{W})$ is non-empty for any $(\alpha,\mathbf{a},\mathbf{b},\mathcal{W})$ .

Our main result in this section is the following asymptotic characterization of $E_{j}^{*}(n,\alpha,\mathbf{P},\lambda|\mathbf{a},\mathbf{b},\mathcal{W})$ .

Theorem 8.

Given any $(\lambda,\alpha)\in\mathbb{R}_{+}^{2}$ and any tuple of target distributions $\mathbf{P}\in\mathcal{P}(\mathcal{X})^{m}$ , for all $j\in[m]$ , we have

[TABLE]

The proof of Theorem 8 is given in Appendix -E.

First, as shown in Fig. 9, there exists $\lambda\in\mathbb{R}_{+}$ such that $\lim_{n\to\infty}E_{j}^{*}(n,\alpha,\mathbf{P},\lambda|\mathbf{a},\mathbf{b},\mathcal{W})<\lambda$ , which implies that the type- $j$ rejection exponents can be designed to be smaller than all the error exponents with an appropriate choice of $\lambda$ . This scenario is reminiscent of practical communication scenarios [22, 23] (automatic repeat request or ARQ) where the rejection probability is designed to be much larger than the (undetected) error probability as declaring a rejection is typically much less costly than a genuine mistake being made.

Second, let us describe the test that is used in the achievability proof of Theorem 8. This test is a generalized version of Unnikrishnan’s test [6]. Given any tuple of types $(\mathbf{T}_{\mathbf{Z}^{n\mathbf{a}}},\{\mathbf{T}_{\tilde{\mathbf{Y}}_{i}^{N\mathbf{b}}}\}_{i\in[m]})$ , we define the indices of the minimum and the second minimum of $\widetilde{\mathrm{LD}}_{i}(\mathbf{T}_{\mathbf{Z}^{n\mathbf{a}}},\{\mathbf{T}_{\tilde{\mathbf{Y}}_{i}^{N\mathbf{b}}}\}_{i\in[m]})$ over all $i\in[m]$ as

[TABLE]

and

[TABLE]

respectively. If the index of the right hand side of (50) is not unique, we define $i_{2}(\mathbf{Z}^{n},\tilde{\mathbf{Y}}^{N})$ as the smallest index of all $i\in[m]$ such that the value of $\widetilde{\mathrm{LD}}_{i}(\mathbf{T}_{\mathbf{Z}^{n\mathbf{a}}},\{\mathbf{T}_{\tilde{\mathbf{Y}}_{i}^{N\mathbf{b}}}\}_{i\in[m]})$ is second smallest. In the achievability proof of Theorem 8, we make use of the following test

[TABLE]

In words, we declare that $\mathrm{H}_{j}$ is true if the minimum of $\widetilde{\mathrm{LD}}_{i}$ occurs when $i=j$ and the second largest $\widetilde{\mathrm{LD}}_{i}$ exceeds a certain threshold $\lambda$ . The latter condition intuitively indicates that our decision that the true hypothesis is $\mathrm{H}_{j}$ is made with high enough confidence. If there are at least two test statistics $\widetilde{\mathrm{LD}}_{i}$ that are no larger than $\lambda$ (i.e., $|\{i\in[m]:\widetilde{\mathrm{LD}}_{i}\leq\lambda\}|\geq 2$ ), our confidence in the quality of the training and test data is low or that our confidence that $X^{n}$ is generated from one of the distributions that generated $Y_{i}^{N},i\in[m]$ is low, and as such, a rejection event should be declared.

Third, when $m=2$ , the test in (III-B) specializes to the test given in (54) presented at the top of the next page

and the type- $j$ rejection exponent in (8) simplifies to

[TABLE]

Note that here we consider binary classification with rejection, which is in contrast to the case of binary classification without a rejection option (cf. (9) and (II-C)).

Fourthly, let us compare the exponents obtained for $m=2$ in Theorems 1 and 8. For the binary distributed detection problem without rejection (Theorem 1), the acceptance region for hypothesis $\mathrm{H}_{1}$ is $\mathcal{Q}_{\lambda}(\alpha,\mathbf{a},\mathbf{b},\mathcal{W})$ (cf. (7)). In this section, when $m=2$ , the rejection region is $\mathcal{Q}_{\lambda,1,2}(\alpha,\mathbf{a},\mathbf{b},\mathcal{W})$ (cf. (III-B)). For any $(\mathbf{Q},\tilde{\mathbf{Q}}_{1},\tilde{\mathbf{Q}}_{2})\in\mathcal{Q}_{\lambda,1,2}(\alpha,\mathbf{a},\mathbf{b},\mathcal{W})$ , we have

[TABLE]

which implies that $(\mathbf{Q},\tilde{\mathbf{Q}}_{1},\tilde{\mathbf{Q}}_{2})\in\mathcal{Q}_{\lambda}(\alpha,\mathbf{a},\mathbf{b},\mathcal{W})$ . With this observation, we see that

[TABLE]

which indicates that the type-II rejection exponent in (8) is not smaller than the type-II error exponent in (9) when restricted to the binary setting. The rough intuition here is that for the same $\lambda$ , if it happens that the optimal test for binary distributed detection with rejection in (54) decides on rejecting the two hypotheses, this implies that the optimal test for binary distributed detection without rejection in (II-C) necessarily declares that $\mathrm{H}_{1}$ is true, thus resulting in a type-II error. The reverse implication, however, is not true.

Finally, if we let $K=1$ and consider all channels to be deterministic, the test in (54) reduces to one presented in (60) at the top of the next page

For the $m$ -ary hypotheses testing problem with rejection in [5], Gutman used $\gamma_{m}^{\mathrm{Gut}}(Z^{n},\tilde{\mathbf{Y}}^{N})$ presented at (61) on the next page

It can be seen that the rejection regions for both the tests are the same. However, the acceptance regions for $\gamma_{m}^{\mathrm{Gut}}$ are not symmetric for different hypotheses. In contrast, Unnikrishnan’s test in (60) is symmetric in the $m$ hypotheses. Thus, it is more convenient to use the generalized Unnikrishnan’s test $\gamma$ in (III-B) to analyze the error and rejection exponents.

III-C Further discussions on $(\mathbf{a},\mathbf{b})$

We have the following corollary of Theorem 8.

Corollary 9.

For each $j\in[m]$ , the type- $j$ rejection exponent satisfies

[TABLE]

where $\kappa(\mathbf{Q},P|\mathbf{a},\mathbf{b},\mathcal{W})$ was defined in (20).

The proof of Corollary 9 is similar to that of Lemma 2 and hence is omitted.

We remark that when $\lambda$ is smaller than a certain threshold $\lambda_{0}=\lambda_{0}(\mathbf{P},\mathbf{a},\mathbf{b})$ , the type- $j$ rejection exponent in (9) is infinite. This is because if $\lambda\leq\lambda_{0}$ , the two constraint sets defined by $\kappa(\mathbf{Q},P_{i}|\mathbf{a},\mathbf{b},\mathcal{W})\leq\lambda$ and $\kappa(\mathbf{Q},P_{l}|\mathbf{a},\mathbf{b},\mathcal{W})\leq\lambda$ are disjoint for all distinct pairs of $i$ and $l$ , and thus, the minimization in (9) is infeasible.

Let $f_{\infty,j}(\mathbf{a},\mathbf{b},\lambda)$ denote the right-hand-side of (9). When $\lambda$ is chosen such that $f_{\infty,j}(\mathbf{a},\mathbf{b},\lambda)<\infty$ for all $(\mathbf{a},\mathbf{b})\in\mathcal{P}([K])^{2}$ , we have the following corollary concerning the optimizers of $f_{\infty,j}(\mathbf{a},\mathbf{b},\lambda)$ when there are only two hypotheses, i.e., $m=2$ .

Corollary 10.

For the binary distributed detection problem with rejection, given each $j\in[2]$ , we have

[TABLE]

and thus the optimizers $(\mathbf{a}^{*},\mathbf{b}^{*})$ are deterministic and satisfy $\mathbf{a}^{*}=\mathbf{b}^{*}$ .

The proof of Corollary 10 is similar to that of Corollary 3 and hence is omitted.

Corollary 10 implies that when the length of the training sequences are much longer than the length of test sequence, it is optimal to use identical local decision rules at each sensor to pre-process the training sequences. It is natural to wonder whether there is a generalization of Corollary 10 for larger $m$ . Numerical optimization of the rejection exponent in (9) over $(\mathbf{a},\mathbf{b})\in\mathcal{P}([K])^{2}$ shows that when $m=3$ and $K=3$ , in general, it is optimal to use all $3$ local decision rules or channels to randomize the test and training sequences. When $K=4$ , however, it is optimal to use all $4$ channels in general. This differs from the main finding in Tsitsiklis’ paper [2], in which $\frac{1}{2}m(m-1)$ local decision rules suffice to achieve optimality. If the result were analogous, one would expect that for any $K$ , only $\frac{1}{2}\cdot 3\cdot(3-1)=3$ local decision rules suffice. This difference can be intuitively explained as follows. With the rejection option, the fusion center needs to partition the sample space into more regions compared to the case without rejection. Roughly speaking, this means that the fusion center needs more information or diversity from the training and test samples to attain optimality. Hence, more channels or local decision rules (compared to [2]) are needed.

IV Conclusion and Future Work

This work has taken a first step at considering the distributed detection problem à la Tsitsiklis [2] when the underlying distributions are unknown but in place of them, we have noisy training samples. We adopted the Gutman formulation [5] in (4) and derived asymptotically optimal exponents for the binary and $m$ -ary cases with and without rejection. While results as conclusive as those in Tsitsiklis’ paper [2] were not obtained, we have several important contributions, including the identification of optimal tests and the conclusion that in the binary case (with and without rejection) and when the number of training samples far exceeds test samples, one decision rule suffices for achieving the optimal error exponent.

In the future, one can consider the following avenues for future work. First, a resolution of Conjecture 5 would be desirable as it would allow us to parallel the main results in [2] for arbitrary and finite $\alpha\in\mathbb{R}_{+}$ . Second, we can consider deriving second-order asymptotic results in the spirit of Zhou, Tan, and Motani [16]. This would shed further insights into the finite-length behavior of the proposed tests. Finally, it would be fruitful to study the statistical learning versions of other distributed detection formulations, e.g., the anonymous heterogeneous version proposed by Chen and Wang [10].

-A Proof of Theorem 1

Recall the definitions of $\mathbf{T}_{\mathbf{z}^{n\mathbf{a}}}$ , $\mathbf{T}_{\tilde{\mathbf{y}}_{1}^{N\mathbf{b}}}$ and $\mathbf{T}_{\tilde{\mathbf{y}}_{2}^{N\mathbf{b}}}$ in Section II-C. Define the following set of types

[TABLE]

and the following set of sequences

[TABLE]

Note that $\mathcal{T}_{{\lambda}}(\alpha,\mathbf{a},\mathbf{b},\mathcal{W})$ is the set $\mathcal{Q}_{\lambda}(\alpha,\mathbf{a},\mathbf{b},\mathcal{W})$ (defined in (64)) but restricted to types.

-A1 Achievability

In the achievability part, given any pair $(\mathbf{a},\mathbf{b})\in\mathcal{P}_{n}([K])\times\mathcal{P}_{\alpha n}([K])$ , we assume that the test $\gamma$ at the fusion center is given by (II-C), but $\lambda$ is replaced by $\tilde{\lambda}=\lambda+\frac{c_{n}}{n}$ , where $c_{n}:=\sum_{k=1}^{K}(L\log(na_{k}+1)+2L\log(\alpha nb_{k}+1))=O(\log n)$ . Then for all pairs $(\tilde{P}_{1},\tilde{P}_{2})\in\mathcal{P}(\mathcal{X})^{2}$ , the type-I error probability can be upper bounded as follows:

[TABLE]

where (70) follows from the definition of $\mathcal{T}_{{\lambda}}(\alpha,\mathbf{a},\mathbf{b},\mathcal{W})$ in (64).

For any $(P_{1},P_{2})\in\mathcal{P}(\mathcal{X})^{2}$ , the type-II error probability can be upper bounded as follows:

[TABLE]

Thus, using the definition of $\mathrm{LD}$ in (6) and the result in (-A1), we have the following lower bound on the type-II error exponent:

[TABLE]

-A2 Converse

Our converse proof proceeds by showing (i) type-based tests (i.e., tests $\gamma$ that depend only on the types or partial types of the data $(Z^{n},\tilde{Y}_{1}^{N},\tilde{Y}_{2}^{N})$ ) are almost optimal, (ii) the test in (II-C) is an asymptotically optimal type-based test.

The following lemma extends that of [16, Lemma 7].

Lemma 11.

For any deterministic test $\gamma$ , $\eta\in[0,1]$ , $(P_{1},P_{2})\in\mathcal{P}(\mathcal{X})^{2}$ and $(\mathbf{a},\mathbf{b})\in\mathcal{P}_{n}([K])\times\mathcal{P}_{\alpha n}([K])$ , we can construct a type-based test $\gamma^{\mathrm{T}}$ such that

[TABLE]

Proof of Lemma 11.

For $h$ and $g$ with proportions $\mathbf{a}$ and $\mathbf{b}$ respectively and any $(x^{n},y_{1}^{N},y_{2}^{N})$ , let $(Z^{n},\tilde{Y}_{1}^{N},\tilde{Y}_{2}^{N})\sim\big{(}\{W_{h(i)}(\cdot|x_{i})\}_{i\in[n]},\{W_{g(i)}(\cdot|y_{1,i})\}_{i\in[N]},\{W_{g(i)}(\cdot|y_{2,i})\}_{i\in[N]}\big{)}$ . Let $\mathcal{P}_{n\mathbf{a}+2N\mathbf{b}}([L])$ denote the set $\big{(}\prod_{k\in[K]}\mathcal{P}_{na_{k}}([L])\times\mathcal{P}_{Nb_{k}}([L])^{2}\big{)}$ and let $\mathbf{Q}=(Q_{1,1},\ldots,Q_{1,K},Q_{2,1},\ldots,Q_{2,K},Q_{3,1},\ldots,Q_{3,K})\in\mathcal{P}_{n\mathbf{a}+2N\mathbf{b}}([L])$ . For any $\mathbf{Q}$ , we use $\mathcal{\tilde{T}}_{\mathbf{Q}}^{n+2N}$ to denote the set of sequence triples $(Z^{n},\tilde{Y}_{1}^{N},\tilde{Y}_{2}^{N})$ such that for all $k\in[K]$ , $Z^{na_{k}}\in\mathcal{\tilde{T}}_{Q_{1,k}}^{na_{k}}$ , $\tilde{Y}_{1}^{Nb_{k}}\in\mathcal{\tilde{T}}_{Q_{2,k}}^{Nb_{k}}$ and $\tilde{Y}_{2}^{Nb_{k}}\in\mathcal{\tilde{T}}_{Q_{3,k}}^{Nb_{k}}$ .

Given any test $\gamma$ , define the following acceptance region:

[TABLE]

Fix any $\eta\in[0,1]$ . Given any $\mathbf{Q}\in\mathcal{P}_{n\mathbf{a}+2N\mathbf{b}}([L])$ , we can then construct the following type-based test $\gamma^{\mathrm{T}}$ :

•

If an $\eta$ fraction of sequence triples in $\mathcal{\tilde{T}}_{\mathbf{Q}}^{n+2N}$ favors hypothesis $\mathrm{H}_{2}$ , i.e. $|\mathcal{A}^{\mathsf{c}}(\gamma,\mathbf{a},\mathbf{b})\cap\mathcal{\tilde{T}}_{\mathbf{Q}}^{n+2N}|>\eta|\mathcal{\tilde{T}}_{\mathbf{Q}}^{n+2N}|$ , then $\gamma^{\mathrm{T}}(\mathbf{Q})=\mathrm{H}_{2}$ ;

•

Otherwise, $\gamma^{\mathrm{T}}(\mathbf{Q})=\mathrm{H}_{1}$ .

For any $(P_{1},P_{2})\in\mathcal{P}(\mathcal{X})^{2}$ and $(\mathbf{a},\mathbf{b})$ , we can relate the error probabilities of the two tests as in (80)–(85) and (86)–(91) at the top of the next page,

where (83) follows since the elements in $\mathcal{T}_{\mathbf{Q}}^{n+2N}$ are equally likely (under any product distribution) for any $\mathbf{Q}$ .

∎

Let $\delta_{n}:=\frac{1}{n}\big{(}L\sum_{k\in[K]}(\log(na_{k}+1)+2\log(Nb_{k}+1))\big{)}=o(1)$ and fix an arbitrary sequence $\{\delta^{\prime}_{n}\}\subset(0,1)$ to be such that $\lim_{n\to\infty}\delta^{\prime}_{n}=0$ .

Lemma 12.

Given any $(\lambda,\alpha)\in\mathbb{R}_{+}^{2}$ and any $(\mathbf{a},\mathbf{b})$ , for any type-based test $\gamma^{\mathrm{T}}$ satisfying that for all pairs of distributions $(\tilde{P}_{1},\tilde{P}_{2})\in\mathcal{P}(\mathcal{X})^{2}$ ,

[TABLE]

we have that for any pair of distributions $(P_{1},P_{2})\in\mathcal{P}(\mathcal{X})^{2}$ ,

[TABLE]

Proof of Lemma 12.

Let $\mathrm{LD}(\mathbf{Q},\tilde{\mathbf{Q}}_{1},\tilde{\mathbf{Q}}_{2},\tilde{P},\tilde{P},P)=\mathrm{LD}(\mathbf{Q},\tilde{\mathbf{Q}}_{1},\tilde{\mathbf{Q}}_{2},\tilde{P},\tilde{P},P|\alpha,\mathbf{a},\mathbf{b},\mathcal{W})$ . In other words, Lemma 12 claims that for any type-based test $\gamma^{\mathrm{T}}$ satisfying (92), if any $(\mathbf{T}_{\mathbf{z}^{n\mathbf{a}}},\mathbf{T}_{\tilde{\mathbf{y}}_{1}^{N\mathbf{b}}},\mathbf{T}_{\tilde{\mathbf{y}}_{2}^{N\mathbf{b}}})\in\mathcal{P}_{n\mathbf{a}+2N\mathbf{b}}([L])$ satisfies

[TABLE]

then we have $\gamma^{\mathrm{T}}(\mathbf{T}_{\mathbf{z}^{n\mathbf{a}}},\mathbf{T}_{\tilde{\mathbf{y}}_{1}^{N\mathbf{b}}},\mathbf{T}_{\tilde{\mathbf{y}}_{2}^{N\mathbf{b}}})=\mathrm{H}_{1}$ .

This claim can be proved by contradiction. Suppose there exists $(\mathbf{Q}_{\mathbf{z}^{n\mathbf{a}}},\mathbf{Q}_{\tilde{\mathbf{y}}_{1}^{N\mathbf{b}}},\mathbf{Q}_{\tilde{\mathbf{y}}_{2}^{N\mathbf{b}}})\in\mathcal{P}_{n\mathbf{a}+2N\mathbf{b}}([L])$ such that

[TABLE]

and $\gamma^{\mathrm{T}}(\mathbf{Q}_{\mathbf{z}^{n\mathbf{a}}},\mathbf{Q}_{\tilde{\mathbf{y}}_{1}^{N\mathbf{b}}},\mathbf{Q}_{\tilde{\mathbf{y}}_{2}^{N\mathbf{b}}})=\mathrm{H}_{2}$ . For all $(\tilde{P}_{1},\tilde{P}_{2})$ , we have

[TABLE]

where (97) follows since $|\mathcal{T}_{z^{na_{k}}}|\geq(na_{k}+1)^{-L}\exp(na_{k}H(T_{z^{na_{k}}}))$ (cf. [15, Ch. 2]) and similar lower bounds hold for $|\mathcal{T}_{\tilde{y}_{1}^{Nb_{k}}}|$ and $|\mathcal{T}_{\tilde{y}_{2}^{Nb_{k}}}|$ . However, if we choose $(\tilde{P}_{1},\tilde{P}_{2})$ such that

[TABLE]

then we have

[TABLE]

where (100) follows from the definition of $\mathrm{LD}(\cdot)$ in (6); (101) follows from (94) and the strict inequality in (102) follows because $\delta^{\prime}_{n}>0$ for all $n$ . The result in (102) contradicts the assumption that (92) is satisfied for all $(\tilde{P}_{1},\tilde{P}_{2})\in\mathcal{P}(\mathcal{X})^{2}$ . ∎

Corollary 13.

Given any $(\lambda,\alpha)\in\mathbb{R}_{+}^{2}$ and any $(\mathbf{a},\mathbf{b})$ , for any test $\gamma$ satisfying that for all $(\tilde{P}_{1},\tilde{P}_{2})\in\mathcal{P}(\mathcal{X})^{2}$ ,

[TABLE]

we have that for any pair of distributions $(P_{1},P_{2})\in\mathcal{P}(\mathcal{X})^{2}$ ,

[TABLE]

Corollary 13 can be directly obtained from Lemma 11 and Lemma 12 by letting $\eta=\frac{1}{2}$ . Using Corollary 13, we have

[TABLE]

Note that the union of the set of types $\cup_{n\in\mathbb{N}}\mathcal{T}_{\lambda}(\alpha,\mathbf{a},\mathbf{b},\mathcal{W})$ (where $\mathcal{T}_{\lambda}$ is defined in (64)) is dense in the set of distributions $\mathcal{Q}_{\lambda}(\alpha,\mathbf{a},\mathbf{b},\mathcal{W})$ (defined in (7)); this follows from the continuity of $(\mathbf{Q},\tilde{\mathbf{Q}}_{1},\tilde{\mathbf{Q}}_{2})\mapsto\min\limits_{(\tilde{P},P)\in\mathcal{P}(\mathcal{X})^{2}}\mathrm{LD}(\mathbf{Q},\tilde{\mathbf{Q}}_{1},\tilde{\mathbf{Q}}_{2},\tilde{P},\tilde{P},P|\alpha,\mathbf{a},\mathbf{b},\mathcal{W})$ . Also $n\rightarrow\infty$ , $\delta_{n},\delta^{\prime}_{n},\frac{c_{n}}{n},\frac{\log 2}{n}\to 0$ . As a result, for any test $\gamma$ satisfying (103), the type-II error exponent can be upper bounded as follows

[TABLE]

Combining the lower and upper bounds in (76) and (-A2) respectively, we conclude that the optimal type-II error exponent is given by (9) in Theorem 1, completing the proof.

-B Proof of Lemma 2

Given any vector $\mathbf{b}\in\mathcal{P}([K])$ and any distributions $(P_{1},P_{2})\in\mathcal{P}(\mathcal{X})^{2}$ , define the following set of distributions

[TABLE]

Recall $\mathrm{LD}(\mathbf{Q},\tilde{\mathbf{Q}}_{1},\tilde{\mathbf{Q}}_{2},\tilde{P},\tilde{P},P|\alpha,\mathbf{a},\mathbf{b},\mathcal{W})$ in (6) and $f_{\alpha}(\mathbf{a},\mathbf{b},\lambda)$ in (17). As $\alpha\to\infty$ , the objective function of $f_{\alpha}(\mathbf{a},\mathbf{b},\lambda)$ tends to infinity unless $(\tilde{\mathbf{Q}}_{1},\tilde{\mathbf{Q}}_{2})\in\tilde{\mathcal{Q}}(P_{1},P_{2}|\mathbf{b},\mathcal{W})$ .

Lemma 14.

For any $\mathbf{Q}\in\mathcal{P}([L])^{K}$ and any $(\tilde{\mathbf{Q}}_{1},\tilde{\mathbf{Q}}_{2})\in\tilde{\mathcal{Q}}(P_{1},P_{2}|\mathbf{b},\mathcal{W})$ , we have

[TABLE]

Proof.

Given any $\tilde{\mathbf{Q}}=(\tilde{Q}_{1},\ldots,\tilde{Q}_{K})\in\mathcal{P}([L])^{K}$ , let

[TABLE]

For any $(\tilde{\mathbf{Q}}_{1},\tilde{\mathbf{Q}}_{2})\in\tilde{\mathcal{Q}}(P_{1},P_{2}|\mathbf{b},\mathcal{W})$ and any $i\in[2]$ ,

[TABLE]

so we have $\mathcal{S}(\{P_{i}W_{k}\}_{k\in[K]},\mathbf{b},\mathcal{W})\subset\mathcal{S}(\tilde{\mathbf{Q}}_{i},\mathbf{b},\mathcal{W})$ . Since $\mathcal{S}(\{P_{i}W_{k}\}_{k\in[K]},\mathbf{b},\mathcal{W})\neq\emptyset$ for $i\in[2]$ , we have $\mathcal{S}(\tilde{\mathbf{Q}}_{i},\mathbf{b},\mathcal{W})\neq\emptyset$ .

Note first that $(\alpha,\tilde{P},P)\in[0,\infty)\times\mathcal{P}(\mathcal{X})^{2}\mapsto\mathrm{LD}(\mathbf{Q},\tilde{\mathbf{Q}}_{1},\tilde{\mathbf{Q}}_{2},\tilde{P},\tilde{P},P|\alpha,\mathbf{a},\mathbf{b},\mathcal{W})$ is jointly continuous, and $\mathcal{P}(\mathcal{X})$ is compact. As such, the function

[TABLE]

with domain $[0,\infty)$ is continuous in $\alpha$ (cf. [24, Lemma 14]). Let

[TABLE]

Thus,

[TABLE]

Since $\mathcal{S}(\tilde{\mathbf{Q}}_{i},\mathbf{b},\mathcal{W})\neq\emptyset$ where $i\in[2]$ for any $(\tilde{\mathbf{Q}}_{1},\tilde{\mathbf{Q}}_{2})\in\tilde{\mathcal{Q}}(P_{1},P_{2}|\mathbf{b},\mathcal{W})$ , we have for any $\alpha\in[0,\infty)$

[TABLE]

If $\tilde{P}^{*}(\infty),P^{*}(\infty)\notin\mathcal{S}(\tilde{\mathbf{Q}}_{1},\mathbf{b},\mathcal{W})$ , then $g(\infty)=\infty$ , which violates (-B). So $\tilde{P}^{*}(\infty),P^{*}(\infty)\in\mathcal{S}(\tilde{\mathbf{Q}},\mathbf{b},\mathcal{W})$ and

[TABLE]

This concludes the proof. ∎

Thus, for any $\mathbf{Q}\in\mathcal{P}([L])^{K}$ and any $(\tilde{\mathbf{Q}}_{1},\tilde{\mathbf{Q}}_{2})\in\tilde{\mathcal{Q}}(P_{1},P_{2}|\mathbf{b},\mathcal{W})$ , we have

[TABLE]

where (119) follows from Lemma 14, (120) follows since $0\leq b_{k}\|P_{1}W_{k}-\tilde{P}W_{k}\|_{\infty}\leq b_{k}\|\tilde{Q}_{1,k}-\tilde{P}W_{k}\|_{\infty}+b_{k}\|\tilde{Q}_{1,k}-P_{1}W_{k}\|_{\infty}=0$ for all $k\in[K]$ , (121) follows from the definition of $\mathcal{P}(P_{1}|\mathbf{b},\mathcal{W})$ in (19) and (122) is due to the definition of $\kappa(\cdot)$ in (20).

Combining above results, we have the desired result in Lemma 2.

-C Proof of Corollary 3

From Lemma 2, we know that as $\alpha\rightarrow\infty$ , given any $\lambda$ and any $(\mathbf{a},\mathbf{b})\in\mathcal{P}([K])^{2}$ , $f_{\alpha}(\mathbf{a},\mathbf{b},\lambda)$ converges to

[TABLE]

For any $(\mathbf{a},\mathbf{b})\in\mathcal{P}([K])^{2}$ , we define

[TABLE]

Recalling the definition of $\kappa(\cdot)$ in (20) and noting that $P_{1}\in\mathcal{P}(P_{1}|\mathbf{b},\mathcal{W})$ , given any $(\mathbf{a},\mathbf{b})$ such that $\mathcal{J}(\mathbf{a},\mathbf{b})\neq\emptyset$ , we have that

[TABLE]

On the other hand, for any $(\mathbf{a},\mathbf{b})$ such that $\mathcal{J}(\mathbf{a},\mathbf{b})=\emptyset$ , we have that

[TABLE]

Note that for any $\mathbf{a}\in\mathcal{P}([K])$ , there exists $\mathbf{b}\in\mathcal{P}([K])$ such that $\mathcal{J}(\mathbf{a},\mathbf{b})=\emptyset$ (e.g., $\mathrm{supp}(\mathbf{a})=\mathrm{supp}(\mathbf{b})$ ); on the other hand, there also exists $\mathbf{b}$ such that $\mathcal{J}(\mathbf{a},\mathbf{b})\neq\emptyset$ , e.g., $\mathrm{supp}(\mathbf{b})\subset\mathrm{supp}(\mathbf{a})$ when $|\mathrm{\operatorname{supp}}(\mathbf{a})|\geq 1$ and $\mathrm{supp}(\mathbf{b})\cap\mathrm{supp}(\mathbf{a})=\emptyset$ when $|\mathrm{\operatorname{supp}}(\mathbf{a})|=1$ . Thus, combining (125) and (126), we have that for any $\lambda\in\mathbb{R}_{+}$ ,

[TABLE]

For each $k\in[K]$ , given $\lambda\in\mathbb{R}_{+}$ , let $Q_{k}^{*}$ achieve $f_{\infty}(\mathbf{e}_{k},\mathbf{e}_{k},\lambda)$ , i.e.,

[TABLE]

Given any $(\mathbf{a},\mathbf{b})$ such that $\mathcal{J}(\mathbf{a},\mathbf{b})=\emptyset$ , from (130), we know that

[TABLE]

and thus

[TABLE]

Combining (128) and (135), we have that

[TABLE]

On the other hand, it is easy to verify that

[TABLE]

The proof of Corollary 3 is completed by combining (136) and (137).

-D Numerical Evaluations of $f_{\alpha}(\mathbf{a},\mathbf{b},\lambda)$ when $K=3$

In this subsection, we present further numerical evidence for Conjecture 5 in Figs. 10 and 11. See the captions for descriptions of the figures.

-E Proof of Theorem 8

Recall the definitions of $i_{1}(\mathbf{Z}^{n},\tilde{\mathbf{Y}}^{N})$ in (49) and $i_{2}(\mathbf{Z}^{n},\tilde{\mathbf{Y}}^{N})$ in (50). Given any tuple of distributions $(\mathbf{Q},\{\tilde{\mathbf{Q}}_{i}\}_{i\in[m]})\in\mathcal{P}([L])^{(m+1)K}$ , we use $\widetilde{\mathrm{LD}}_{j}$ to denote $\widetilde{\mathrm{LD}}_{j}(\mathbf{Q},\{\tilde{\mathbf{Q}}_{i}\}_{i\in[m]})$ if there is no risk of confusion.

-E1 Achievability

We use the test in (III-B) with $\lambda$ replaced by $\tilde{\lambda}=\lambda+\frac{c_{n}}{n}$ , where $c_{n}:=\sum_{k\in[K]}L(\log(na_{k}+1)+m\log(\alpha nb_{k}+1))=O(\log n)$ . Given any $\mathbf{P}\in\mathcal{P}(\mathcal{X})^{m}$ , for any tuple of types $(\mathbf{T}_{\mathbf{Z}^{n\mathbf{a}}},\{\mathbf{T}_{\tilde{\mathbf{Y}}_{i}^{N\mathbf{b}}}\}_{i\in[m]})\in\mathcal{P}_{n\mathbf{a}+mN\mathbf{b}}([L])$ , for each $j\in[m]$ , the type- $j$ error and rejection probabilities for the test in (III-B) are respectively

[TABLE]

For each $j\in[m]$ and for all $\tilde{\mathbf{P}}\in\mathcal{P}(\mathcal{X})^{m}$ , we upper bound the type- $j$ error probability as follows:

[TABLE]

where (141) follows since $\bigcap_{i\neq i_{1},i_{1}\neq j}\big{\{}\widetilde{\mathrm{LD}}_{i}>\tilde{\lambda}\big{\}}\subset\big{\{}\widetilde{\mathrm{LD}}_{j}>\tilde{\lambda}\big{\}}$ and (145) follows from the definition of $\widetilde{\mathrm{LD}}_{j}$ in (44).

Similarly, for each $j\in[m]$ , we upper bound the type- $j$ rejection probability as follows:

[TABLE]

Using (-E1) and the definition of $\mathrm{LD}_{j}^{[m]}(\mathbf{Q},\{\tilde{\mathbf{Q}}_{i}\}_{i\in[m]},\mathbf{P})$ in (43), we arrive at the following lower bound on the type- $j$ rejection exponent:

[TABLE]

-E2 Converse

Similar to the binary case, the converse proof proceeds by showing (i) type-based tests (i.e., tests $\gamma$ that depend only on the types or partial types of the sequences $(Z^{n},\tilde{\mathbf{Y}}^{N})$ , i.e., $\mathbf{T}_{\mathbf{Z}^{n\mathbf{a}}}$ and $\{\mathbf{T}_{\tilde{\mathbf{Y}}_{i}^{N\mathbf{b}}}\}_{i\in[m]}$ ), are almost optimal and (ii) the test in (III-B) is an asymptotically optimal type-based test.

Lemma 15.

For any test $\gamma$ , $(\eta_{1},\ldots,\eta_{m})\in[0,1]^{m}$ , any $(\mathbf{a},\mathbf{b})\in\mathcal{P}_{n}([K])\times\mathcal{P}_{\alpha n}([K])$ and any tuple of distributions $\mathbf{P}\in\mathcal{P}(\mathcal{X})^{m}$ , we can construct a type-based test $\gamma^{\mathrm{T}}$ such that for each $j\in[m]$ ,

[TABLE]

where $\eta_{\min}:=\min_{i\in[m]}\eta_{i}$ and $\eta_{\mathrm{sum}}:=\sum_{i\in[m]}\eta_{i}$ .

The proof of Lemma 15 is similar to the proof of Lemma 11 and thus omitted.

Before starting the next result, let $\delta_{n}=\frac{c_{n}}{n}$ , and fix an arbitrary sequence $\{\delta_{n}^{\prime}\}\subset(0,1)$ be such that $\lim_{n\to\infty}\delta_{n}^{\prime}=0$ .

Lemma 16.

Given any $(\lambda,\alpha)\in\mathbb{R}_{+}^{2}$ , any $(\mathbf{a},\mathbf{b})$ , for any type-based test $\gamma^{\mathrm{T}}$ satisfying that for all tuples of distributions $\tilde{\mathbf{P}}\in\mathcal{P}(\mathcal{X})^{m}$ ,

[TABLE]

we have that for any $\mathbf{P}\in\mathcal{P}(\mathcal{X})^{m}$ ,

[TABLE]

Proof.

The lemma is proved by showing that for any type-based test $\gamma^{\mathrm{T}}$ satisfying (155), if any $(\mathbf{T}_{\mathbf{Z}^{n\mathbf{a}}},\{\mathbf{T}_{\tilde{\mathbf{Y}}_{i}^{N\mathbf{b}}}\}_{i\in[m]})\in\mathcal{P}_{n\mathbf{a}+mN\mathbf{b}}([L])$ satisfies

[TABLE]

then we have $\gamma^{\mathrm{T}}(\mathbf{T}_{\mathbf{z}^{n\mathbf{a}}},\{\mathbf{T}_{\tilde{\mathbf{y}}_{i}^{N\mathbf{b}}}\}_{i\in[m]})=\mathrm{H}_{\mathrm{r}}$ .

To prove this claim, it suffices to show by contradiction that there exists $(\mathbf{Q}_{\mathbf{Z}^{n\mathbf{a}}},\{\mathbf{Q}_{\tilde{\mathbf{Y}}_{i}^{N\mathbf{b}}}\}_{i\in[m]})\in\mathcal{P}_{n\mathbf{a}+mN\mathbf{b}}([L])$ such that i) $\gamma^{\mathrm{T}}(\mathbf{Q}_{\mathbf{z}^{n\mathbf{a}}},\{\mathbf{Q}_{\tilde{\mathbf{y}}_{i}^{N\mathbf{b}}}\}_{i\in[m]})=\mathrm{H}_{k}$ for some $k\in[m]$ , and ii) there exists $(l,j)\in[m]^{2}$ such that $l\neq j$ and

[TABLE]

In the following analysis, fix $j\in[m]$ such that $j\neq k$ . We can then lower bound the type- $j$ error probability as follows:

[TABLE]

If we set

[TABLE]

then we have

[TABLE]

where (166) follows from the definition of $\widetilde{\mathrm{LD}}_{j}$ in (44). Thus, the inequality in (168) contradicts the conditions in (155) and the proof of Lemma 16 is completed. ∎

Using Lemmas 15 and 16, we obtain the following corollary, which provides a lower bound on the rejection probability for any test whose error probabilities decay exponentially fast under all hypotheses for all tuples of distributions.

Corollary 17.

Given any $(\lambda,\alpha)\in\mathbb{R}_{+}^{2}$ , any $(\mathbf{a},\mathbf{b})$ , for any test $\gamma$ satisfying that for all tuples of distributions $\tilde{\mathbf{P}}\in\mathcal{P}(\mathcal{X})^{m}$ ,

[TABLE]

we have that for any $\mathbf{P}\in\mathcal{P}(\mathcal{X})^{m}$ ,

[TABLE]

Since

[TABLE]

using Corollary 17, we have

[TABLE]

Using (174), for each $j\in[m]$ , given any tuple of distributions $\mathbf{P}$ , the type- $j$ rejection exponent can be upper bounded as follows

[TABLE]

Acknowledgments

The authors would like to thank Nicolas Gillis and I-Hsiang Wang for helpful discussions.

Bibliography24

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] R. R. Tenney and N. R. Sandell, “Detection with distributed sensors,” IEEE Transactions on Aerospace and Electronic Systems , no. 4, pp. 501–510, 1981.
2[2] J. N. Tsitsiklis, “Decentralized detection by a large number of sensors,” Mathematics of Control, Signals and Systems , vol. 1, no. 2, pp. 167–182, 1988.
3[3] J.-F. Chamberland and V. V. Veeravalli, “Wireless sensors in distributed detection applications,” IEEE Signal Processing Magazine , vol. 24, no. 3, pp. 16–25, 2007.
4[4] W. P. Tay, J. N. Tsitsiklis, and M. Z. Win, “Bayesian detection in bounded height tree networks,” IEEE Transactions on Signal Processing , vol. 57, no. 10, pp. 4042–4051, 2009.
5[5] M. Gutman, “Asymptotically optimal classification for multiple tests with empirically observed statistics,” IEEE Transactions on Information Theory , vol. 35, no. 2, pp. 401–408, 1989.
6[6] J. Unnikrishnan, “Asymptotically optimal matching of multiple sequences to source distributions and training sequences,” IEEE Transactions on Information Theory , vol. 61, no. 1, pp. 452–468, 2015.
7[7] J. Ziv, “On classification with empirically observed statistics and universal data compression,” IEEE Transactions on Information Theory , vol. 34, no. 2, pp. 278–286, 1988.
8[8] J.-F. Chamberland and V. V. Veeravalli, “Decentralized detection in sensor networks,” IEEE Transactions on Signal Processing , vol. 51, no. 2, pp. 407–416, 2003.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Distributed Detection with

Abstract

Index Terms:

I Introduction

I-A Main Contributions

I-B Related Works

I-C Paper Outline

I-D Notation

II Binary Distributed Detection with Training Samples

II-A Problem Formulation

II-B Definitions

II-C Main Results

Theorem 1**.**

II-D Further Discussions on the Impact of the Proportions of Local Decision Rules (a,b)(\mathbf{a},\mathbf{b})(a,b) on the Exponent

Lemma 2**.**

Corollary 3**.**

Lemma 4**.**

II-E * Numerical Study on Optimal Proportions of Local Decision Rules*

II-E1 When One Local Decision Rule is Optimal

Conjecture 5**.**

II-E2 Results without Assumptions on Channels

II-F Connections to Results in Distributed Detection

Corollary 6**.**

Proof sketch of Corollary 6.

Corollary 7**.**

Proof sketch of Corollary 7.

III mmm-ary Distributed Detection with the Rejection Option and Training Samples

III-A Problem Formulation

III-B Main Results

Theorem 8**.**

III-C Further discussions on (a,b)(\mathbf{a},\mathbf{b})(a,b)

Corollary 9**.**

Corollary 10**.**

IV Conclusion and Future Work

-A Proof of Theorem 1

-A1 Achievability

-A2 Converse

Lemma 11**.**

Proof of Lemma 11.

Lemma 12**.**

Proof of Lemma 12.

Corollary 13**.**

-B Proof of Lemma 2

Lemma 14**.**

Proof.

-C Proof of Corollary 3

-D Numerical Evaluations of fα(a,b,λ)f_{\alpha}(\mathbf{a},\mathbf{b},\lambda)fα​(a,b,λ) when K=3K=3K=3

-E Proof of Theorem 8

-E1 Achievability

-E2 Converse

Lemma 15**.**

Lemma 16**.**

Proof.

Corollary 17**.**

Acknowledgments

Theorem 1.

II-D Further Discussions on the Impact of the Proportions of Local Decision Rules $(\mathbf{a},\mathbf{b})$ on the Exponent

Lemma 2.

Corollary 3.

Lemma 4.

Conjecture 5.

Corollary 6.

Corollary 7.

III $m$ -ary Distributed Detection with the Rejection Option and Training Samples

Theorem 8.

III-C Further discussions on $(\mathbf{a},\mathbf{b})$

Corollary 9.

Corollary 10.

Lemma 11.

Lemma 12.

Corollary 13.

Lemma 14.

-D Numerical Evaluations of $f_{\alpha}(\mathbf{a},\mathbf{b},\lambda)$ when $K=3$

Lemma 15.

Lemma 16.

Corollary 17.