Robust control of varying weak hyperspectral target detection with   sparse non-negative representation

Raphael Bacher; Celine Meillier; Florent Chatelain; Olivier Michel

arXiv:1702.00609·stat.ME·May 24, 2017·IEEE Trans. Signal Process.

Robust control of varying weak hyperspectral target detection with sparse non-negative representation

Raphael Bacher, Celine Meillier, Florent Chatelain, Olivier Michel

PDF

TL;DR

This paper introduces a robust hyperspectral source detection method using sparse non-negative representations and false discovery rate control, effectively handling spatially varying faint signals in high-dimensional data.

Contribution

It develops a novel multiple-comparison detection approach with robust error control tailored for hyperspectral data using sparse, non-negative representations on coherent dictionaries.

Findings

01

Successfully applied to real Multi-Unit Spectrograph Explorer data

02

Achieves reliable detection of faint hyperspectral sources

03

Provides controlled false discovery rate in high-dimensional testing

Abstract

In this study, a multiple-comparison approach is developed for detecting faint hyperspectral sources. The detection method relies on a sparse and non-negative representation on a highly coherent dictionary to track a spatially varying source. A robust control of the detection errors is ensured by learning the test statistic distributions on the data. The resulting control is based on the false discovery rate, to take into account the large number of pixels to be tested. This method is applied to data recently recorded by the three-dimensional spectrograph Multi-Unit Spectrograph Explorer.

Tables1

Table 1. TABLE I: Comparison between FDR and PFA control on data with and without target. Data was built from noise-only regions in MUSE real data and a synthetic source was added. Number of tests is 2500 (50 by 50 pixels) and source size is 185 pixels. Results were averaged on 5 different regions.

	PFA 0.05	PFA 0.001	FDR 0.2
Noise-only
False Detections (pixels)	144	2	0
True Detections (pixels)	0	0	0
Source + noise area
False Detections (pixels)	117	2	30
True Detections (pixels)	153	106	133
False discovery proportion (%)	43.3	1.8	18.4

Equations85

\displaystyle\left\{\begin{array}[]{ll}\mathcal{H}_{0}:\boldsymbol{y}=\boldsymbol{\epsilon},\\ \mathcal{H}_{1}:\boldsymbol{y}=\boldsymbol{x}+\boldsymbol{\epsilon},\end{array}\right.

\displaystyle\left\{\begin{array}[]{ll}\mathcal{H}_{0}:\boldsymbol{y}=\boldsymbol{\epsilon},\\ \mathcal{H}_{1}:\boldsymbol{y}=\boldsymbol{x}+\boldsymbol{\epsilon},\end{array}\right.

x \approx a_{i_{1}} d_{i_{1}} + \dots + a_{i_{k}} d_{i_{k}}, s.t. a_{i_{j}} > 0, for 1 \leq j \leq k,

x \approx a_{i_{1}} d_{i_{1}} + \dots + a_{i_{k}} d_{i_{k}}, s.t. a_{i_{j}} > 0, for 1 \leq j \leq k,

\displaystyle\begin{split}\left\{\begin{array}[]{ll}\mathcal{H}_{0}:a_{1}=a_{2}=\ldots=a_{m}=0,\\ \mathcal{H}_{1}:\textrm{at least one }a_{i}>0,\end{array}\right.\end{split}

\displaystyle\begin{split}\left\{\begin{array}[]{ll}\mathcal{H}_{0}:a_{1}=a_{2}=\ldots=a_{m}=0,\\ \mathcal{H}_{1}:\textrm{at least one }a_{i}>0,\end{array}\right.\end{split}

S (y, d) \equiv ⟨ \frac{d}{∣∣ d ∣∣}, y ⟩ = d^{T} y,

S (y, d) \equiv ⟨ \frac{d}{∣∣ d ∣∣}, y ⟩ = d^{T} y,

S (y, d) \equiv \frac{⟨ d , y ⟩}{∣∣ d ∣∣∣∣ y ∣∣} = \frac{d ^{T} y}{∣∣ y ∣∣},

S (y, d) \equiv \frac{⟨ d , y ⟩}{∣∣ d ∣∣∣∣ y ∣∣} = \frac{d ^{T} y}{∣∣ y ∣∣},

T_{max} (y) \equiv 1 \leq j \leq m max S (y, d_{j}) H_{0} ≷ H_{1} η,

T_{max} (y) \equiv 1 \leq j \leq m max S (y, d_{j}) H_{0} ≷ H_{1} η,

T_{max} (- ϵ)

T_{max} (- ϵ)

= - j min S (ϵ, d_{j}) = - T_{min} (ϵ),

F (t) = π_{0} F_{0} (t) + π_{1} F_{1} (t),

F (t) = π_{0} F_{0} (t) + π_{1} F_{1} (t),

F (t) = π_{0} F_{0} (t), for t \leq μ_{0} .

F (t) = π_{0} F_{0} (t), for t \leq μ_{0} .

G (t) = π_{0} G_{0} (t) + π_{1} G_{1} (t),

G (t) = π_{0} G_{0} (t) + π_{1} G_{1} (t),

G (t) = π_{0} G_{0} (t), for t \geq μ_{0} .

G (t) = π_{0} G_{0} (t), for t \geq μ_{0} .

π_{0} F_{0} (t)

π_{0} F_{0} (t)

\overline{F} (t) = \frac{# { T _{max} ( y _{i} ) \leq t }}{n}, \overline{G} (t) = \frac{# { - T _{min} ( y _{i} ) > t }}{n},

\overline{F} (t) = \frac{# { T _{max} ( y _{i} ) \leq t }}{n}, \overline{G} (t) = \frac{# { - T _{min} ( y _{i} ) > t }}{n},

t sup ∣ \overline{F} (t) - F (t) ∣ ⟶ 0, t sup ∣ \overline{G} (t) - G (t) ∣ ⟶ 0,

t sup ∣ \overline{F} (t) - F (t) ∣ ⟶ 0, t sup ∣ \overline{G} (t) - G (t) ∣ ⟶ 0,

\overline{F} (μ) = \overline{G} (μ) .

\overline{F} (μ) = \overline{G} (μ) .

μ_{0} = \frac{t _{(n)} + t _{(n + 1)}}{2} .

μ_{0} = \frac{t _{(n)} + t _{(n + 1)}}{2} .

s_{0} = {T_{max} (y_{i}) \leq μ_{0}},

s_{0} = {T_{max} (y_{i}) \leq μ_{0}},

g_{0} = {- T_{min} (y_{i}) > μ_{0}},

g_{0} = {- T_{min} (y_{i}) > μ_{0}},

π_{0}

π_{0}

F_{0} (t)

F_{0} (t)

= E [\frac{U}{max ( R , 1 )}],

= E [\frac{U}{max ( R , 1 )}],

\hat{k} = max {0 \leq k \leq n : p_{(k)} \leq q \frac{k}{n}},

\hat{k} = max {0 \leq k \leq n : p_{(k)} \leq q \frac{k}{n}},

p_{i} = 1 - F_{0} (T_{max} (y_{i})), for 1 \leq i \leq n .

p_{i} = 1 - F_{0} (T_{max} (y_{i})), for 1 \leq i \leq n .

\overset{π}{^}_{0}^{*} (ζ) = min {\frac{1 + # { p _{i} > ζ }}{( 1 - ζ ) n}, 1}, for ζ \in [0, 1),

\overset{π}{^}_{0}^{*} (ζ) = min {\frac{1 + # { p _{i} > ζ }}{( 1 - ζ ) n}, 1}, for ζ \in [0, 1),

\left\{\begin{array}[]{ll}\mathcal{H}_{0}:\boldsymbol{y}=\boldsymbol{\epsilon},\\ \mathcal{H}_{1}:\boldsymbol{y}=\boldsymbol{D}\boldsymbol{a}+\boldsymbol{\epsilon},\text{ with }||\boldsymbol{a}||_{0}=1,\boldsymbol{a}\geq\boldsymbol{0}\end{array}\right.

\left\{\begin{array}[]{ll}\mathcal{H}_{0}:\boldsymbol{y}=\boldsymbol{\epsilon},\\ \mathcal{H}_{1}:\boldsymbol{y}=\boldsymbol{D}\boldsymbol{a}+\boldsymbol{\epsilon},\text{ with }||\boldsymbol{a}||_{0}=1,\boldsymbol{a}\geq\boldsymbol{0}\end{array}\right.

T_{G L R} (y) = \frac{a max p ( y ∣ D a , H _{1} )}{p ( y ∣ H _{0} )} s.t. ∣∣ a ∣ ∣_{0} = 1, a \geq 0,

T_{G L R} (y) = \frac{a max p ( y ∣ D a , H _{1} )}{p ( y ∣ H _{0} )} s.t. ∣∣ a ∣ ∣_{0} = 1, a \geq 0,

T_{G L R} (y) = \frac{d _{\hat{j}}^{T} Σ ^ ^{- 1} y}{d _{\hat{j}}^{T} Σ ^ ^{- 1} d _{\hat{j}}}

T_{G L R} (y) = \frac{d _{\hat{j}}^{T} Σ ^ ^{- 1} y}{d _{\hat{j}}^{T} Σ ^ ^{- 1} d _{\hat{j}}}

α_{m} = Pr (max z^{m} > η), under H_{0} .

α_{m} = Pr (max z^{m} > η), under H_{0} .

α_{m} = 1 - Pr (max z^{m} \leq η) = 1 - Pr (z_{1}^{m} \leq η)^{m}, = 1 - Φ (η)^{m},

α_{m} = 1 - Pr (max z^{m} \leq η) = 1 - Pr (z_{1}^{m} \leq η)^{m}, = 1 - Φ (η)^{m},

M_{m + 1} (t) = Pr (z_{1}^{m + 1} \leq t ∣ z_{2}^{m + 1} \leq t, z_{3}^{m + 1} \leq t) \times M_{m} (t),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Robust control of varying weak hyperspectral target detection with sparse non-negative representation

Raphael Bacher, Celine Meillier, Florent Chatelain and Olivier Michel The authors are with the Image and Signal Department, GIPSA-Lab, Grenoble Institute of Technology, Saint Martin d’Heres 38400, France (e-mail: [email protected]).

Abstract

In this study, a multiple-comparison approach is developed for detecting faint hyperspectral sources. The detection method relies on a sparse and non-negative representation on a highly coherent dictionary to track a spatially varying source. A robust control of the detection errors is ensured by learning the test statistic distributions on the data. The resulting control is based on the false discovery rate, to take into account the large number of pixels to be tested. This method is applied to data recently recorded by the three-dimensional spectrograph Multi-Unit Spectrograph Explorer.

I Introduction

With the constant development of new imaging devices, the exploration of massive multi-modalities datasets has become an important field of study, with many challenges to face. In particular, the present study is motivated by the need to detect faint emission line features in massive hyperspectral data produced by the recent three-dimensional (3D) spectrograph Multi-Unit Spectrograph Explorer (MUSE) instrument [1]. The targeted emission lines are markers of spatially extended structures (or ’halos’, surrounding galaxies) that can exhibit spectral variability (spectral shifts). Furthermore, the presence of a large number of nuisance sources with high dynamics in the background makes the estimation of the background statistics particularly challenging. Searching faint signals in massive datasets requires the removal of possibly strong contributions of unwanted objects. Typically, for tracking faint signatures in hyperspectral data with high background dynamics, the spectra baseline must be estimated and removed. Finally, the size of the data to be explored calls for error controls of mis-detections with a global significance, such as the false discovery rate (FDR)[2], to allow unsupervised detection of the targets.

To summarize, the present study addresses a quite general detection problem whose main features are:

•

highly variable, partially known, weak target signatures;

•

background difficult to model;

•

possible presence of strong contributions from unwanted sources that have to be removed;

•

necessity for robust global mis-detections control.

Many methods have been developed in recent years to detect targets in hyperspectral data [3], all requiring knowledge of the background and/or the target signature. Among these studies, spectral anomaly detectors can be used when the searched signal is unknown, but rely on a parametric statistical modeling of the background signature [4] Spectral matching detectors, such as adaptive matched filters [5] or adaptive cosine estimators [6], exploit prior knowledge of both the target signature and the background characterization. A common feature of these approaches is the lack of global control of false alarm rate. Furthermore, these detectors were developed in a remote-sensing framework [7] at rather high signal-to-noise ratio (SNR) and can barely be adapted to the new challenges (low SNR, absence of ground-truth…) raised by new massive hyperspectral data such as those provided by new instruments in astronomy. An alternate class of methods relies on sparse representation [8] based on a trained dictionary. These are in fact mostly reconstruction methods, exploited for detection purposes. Up to our knowledge these methods do not allow to calibrate the type I error (false alarm) control of the test. Another pitfall is that in general no training set are available in the astronomical context. Finally, recent approaches [9], [10] try to tackle a problem having the same features as ours. However, the Generalized Likelihood Ratio based solutions proposed in these studies do not allow reliable control of detection error. This control is crucial for massive datasets in general, and for the present MUSE dataflow in particular, and is the core of this study.

Rare events or sparse signals detection problems have received much attention in the recent statistical literature. Multiple-comparison procedures, such as higher criticism or Bonferroni-type methods, have been proposed and shown to have asymptotic optimal detection properties under sparsity regimes [11, 12, 13, 14]. Such methods do not require specification of the signals/events to be detected. Higher criticism procedures can be viewed as adaptive to the unknown sparsity level and power of the signals to be detected. These multiple-comparison procedures can be applied on the dictionary-based representations of the signals. Overcomplete and/or coherent dictionaries are prone to provide sparse representation matched to the application at hand, and were shown to greatly improve the detection power[11]. Again no global error control is performed by these methods.

The purpose of the present paper is thus threefold:

•

derive a detection algorithm that benefits from the detection power of the sparse representation based approaches;

•

propose a method that requires very weak assumptions on the background or target statistical properties;

•

operate a test procedure that allow a control of the global false alarm rate (FDR).

The proposed method is built upon a spectral matching approach over a highly coherent dictionary of target spectra, to take the variability into account. It elaborates on the max-test study presented in [13] for the detection of rare events. As such, it is built on a sparse representation whose purpose is to allow the formulation of a detection test, that do not need any signal reconstruction. To insure a calibration of the test procedure (FDR wise) that is robust to background misspecification, a new simple procedure is proposed, that mainly requires symmetry of the noise distribution. Note that exploiting this symmetry of the noise versus the positivity of sources in astronomical context was also developed in [15] but without providing a global control of the errors nor the formalization of a varying target matched over a dictionary of spectral shapes.

The paper is organized as follows. Section II defines the core of the proposed detection approach. The application-oriented design of the dictionary, the data preprocessing step, and the results on the real MUSE data are described in section III. Some conclusions and perspectives are drawn up in section IV.

Notations

In the following, a ’pixel’ refers indifferently to a position in the MUSE spatial grid and to the associated spectrum. A pixel or its associated spectrum vector is represented by bold letters e.g. $\boldsymbol{x}$ , and bold capital letters refer to matrices.

II Detection method

II-A Testing problem

We address now the detection of a signal $\boldsymbol{x}$ , from noisy data $\boldsymbol{y}\in\mathbb{R}^{l}$ . Let $\mathcal{H}_{0}$ and $\mathcal{H}_{1}$ be the hypotheses denoting, respectively, the absence or presence of the source contribution $\boldsymbol{x}$ . The testing problem is:

[TABLE]

where $\boldsymbol{\epsilon}\in\mathbb{R}^{l}$ is a noise vector, centered and independent of $\boldsymbol{x}$ , for which the distribution is not known.

When $\boldsymbol{x}$ is not fully specified, a classical approach for (3) consists of modeling $\boldsymbol{x}$ as a sparse superposition of reference signals taken from a massively overcomplete dictionary $\boldsymbol{D}$ , see for instance [16], or [17] for classification tasks. The reference signals, or atoms, correspond to the column vector $\boldsymbol{d}_{j}\in\mathbb{R}^{l}$ , for $1\leq j\leq m$ , of $\boldsymbol{D}\in\mathbb{R}^{l\times m}$ , where $m$ is the total number of references. These atoms are usually scaled to be $\ell_{2}$ -normalized: $||\boldsymbol{d}_{j}||_{2}=1$ for $1\leq j\leq m$ .

Moreover, in the present context, the signal of interest $\boldsymbol{x}$ is generally assumed to be non-negative. Thus, to enforce a non-negative decomposition, the atoms $\boldsymbol{d}_{j}$ , for $1\leq j\leq m$ are assumed to be non-negative. It should be noted that, unlike most of the sparse representation techniques in the literature, we do not seek to build an optimal dictionary for reconstruction/estimation but for the design of a good detection test. The dictionary construction, based on physical priors and specific to the application at hand, will be addressed in section III-C1 in the framework of MUSE application111Note that in this application the non-negativity constraints can be relaxed. The target $\boldsymbol{x}$ and the atoms $\boldsymbol{d}_{j}$ can have negative or positive contributions, as long as their dot product $\boldsymbol{x}^{T}\boldsymbol{d}_{j}$ are non-negative for $1\leq j\leq m$ .. Under the non-negativity and sparsity assumptions, the target signal can be expressed as

[TABLE]

where $k\ll m$ . Based on this representation, the detection problem reduces to a multiple comparison procedure with one-sided tests:

[TABLE]

II-B Test statistic

We now search a test statistic adapted to the detection problem (5) obtained by sparse representation. Let us first consider the statistic for a single atom. Let $S(\boldsymbol{y},\boldsymbol{d})$ be a measure of similarity between the observed data $\boldsymbol{y}\in\mathbb{R}^{l}$ and a normalized reference vector $\boldsymbol{d}\in\mathbb{R}^{l}$ . Popular examples of similarity measures include the matched filter statistic

[TABLE]

or the spectral angular distance (SAD)

[TABLE]

which is a classical distance in hyperspectral analysis [18]. For a given signal amplitude $a=||\boldsymbol{y}||>0$ , such similarity measures are maximized when $\boldsymbol{y}=a\boldsymbol{d}$ .

Based now on the pairwise similarity measures $S(\boldsymbol{y},\boldsymbol{d}_{j})$ between the observation $\boldsymbol{y}$ and the dictionary atoms $\boldsymbol{d}_{j}$ , for $1\leq j\leq m$ , a global test for the multiple (with regards to the $m$ atoms) testing problem introduced in (5) can be derived from a Bonferroni-like correction. Accounting for (4), this leads us to consider the following one-sided max-test approach:

[TABLE]

where $\eta$ is a given threshold. The motivations for using this max-test approach are two-fold. First, from a theoretical point of view, with highly sparse signals, the max-test method is asymptotically as efficient as the asymptotically optimal higher criticism method, as demonstrated in [11, 13]. Secondly, in finite sample settings such as the MUSE hyperspectral datasets, max-test has been shown to be relatively efficient [9, 19], and empirically more powerful than higher criticism methods [20].

We now tackle the problem of applying the max-test defined in (8) to a large number $n$ of data realisations $\{\boldsymbol{y}_{i}\}_{1\leq i\leq n}$ . We are again in a multiple-testing context, now with regards to the number of samples $n$ . This is in regards to this context that we seek to control the detection errors. To fix the decision threshold $\eta$ while controlling the type I errors, i.e., the samples under $\mathcal{H}_{0}$ that will be falsely detected as $\mathcal{H}_{1}$ , the distribution under the null hypothesis $\mathcal{H}_{0}$ of the max-test statistic $T_{\textrm{max}}(\boldsymbol{y})$ must be known. In real applications such as the MUSE data, due to the physical process and preprocessing steps (e.g., interpolation, background subtraction), noise is spatially and spectrally correlated with an unknown complex dependence structure. Thus the distribution of $T_{\textrm{max}}(\boldsymbol{\epsilon})$ , where $\boldsymbol{\epsilon}$ is the noise vector introduced in (3), cannot be easily modeled as a standard parametric distribution. However in a large scale testing framework, it becomes possible to estimate this distribution, as explained in the next section.

II-C Learning the null distribution

Consider now the following assumptions:

A 1.

The noise vector $\boldsymbol{\epsilon}$ is centered and symmetrically distributed: $\boldsymbol{\epsilon}$ and $-\boldsymbol{\epsilon}$ have the same distribution,

A 2.

The similarity measure $S(\boldsymbol{y},\boldsymbol{d})$ used to construct the max-test statistics is an odd function of the observations $\boldsymbol{y}$ , i.e. $S(\boldsymbol{y},\boldsymbol{d})=-S(-\boldsymbol{y},\boldsymbol{d})$ for any sample $\boldsymbol{y}$ and for any reference vector $\boldsymbol{d}$ .

Assumption A1 on the noise is relatively weak and fairly reasonable. This is, for instance, satisfied for any centered elliptical distribution, such as multivariate Gaussian or student distributions. Note also that assumption A2 is clearly satisfied for the matched filter or the SAD statistics described respectively in (6) and (7). A direct consequence of these assumptions is the following key property:

Proposition II.1.

Based on assumptions A1 and A2, the max-test statistic $T_{\textrm{max}}(\boldsymbol{y})$ and the opposite of the min statistic $-T_{\textrm{min}}(\boldsymbol{y})$ , where $T_{\textrm{min}}(\boldsymbol{y})\equiv\min\limits_{j}S(\boldsymbol{y},\boldsymbol{d}_{j})$ , are identically distributed under the null hypothesis $\mathcal{H}_{0}$ .

Proof.

Under the null hypothesis $\boldsymbol{y}=\boldsymbol{\epsilon}$ . According to A1, $T_{\textrm{max}}(\boldsymbol{\epsilon})$ and $T_{\textrm{max}}(-\boldsymbol{\epsilon})$ are identically distributed. Moreover,

[TABLE]

where the first equality on the second line is due to A2. Thus $T_{\textrm{max}}(\boldsymbol{\epsilon})$ and - $T_{\textrm{min}}(\boldsymbol{\epsilon})$ have the same distribution. ∎

In a large-scale testing framework (with regards to the number of samples $n$ ), the max and min statistics $T_{\textrm{max}}(\boldsymbol{y})$ and $T_{\textrm{min}}(\boldsymbol{y})$ are computed for a large number of observations $\boldsymbol{y}_{i}$ , for $1\leq i\leq n$ . Let $\pi_{0}\in(0,1]$ be the true proportion of observations $\boldsymbol{y}_{i}$ distributed according to the null hypothesis, while $\pi_{1}=1-\pi_{0}$ is the proportion of observations distributed according to the alternative hypothesis for the testing problem (3). Let $F(t)=\Pr{\left(T_{\textrm{max}}(\boldsymbol{y})\leq t\right)}$ be the cumulative distribution function of the max statistic $T_{\textrm{max}}(\boldsymbol{y})$ . This distribution function can be expressed as a two-groups model:

[TABLE]

where $F_{0}$ and $F_{1}$ denote the distribution functions under the null and the alternative hypotheses, respectively. Under the non-negativity assumption introduced in section II-B, $T_{\textrm{max}}(\boldsymbol{y})$ should be stochastically larger under $\mathcal{H}_{1}$ than under $\mathcal{H}_{0}$ , i.e., $F_{0}(t)>F_{1}(t)$ for any $t\in\mathbb{R}$ . Let $\mu_{0}$ be the median of the max test statistics under the null hypothesis222For the sake of simplicity, the observations are assumed to obey absolutely continuous distributions. Thus the test statistics $T$ are also continuous, and their median $\mu$ is defined as $\Pr(T\leq\mu)=\Pr(T\geq\mu)=\frac{1}{2}$ . The extension to discrete statistics is left to the reader. (referred to as the null median hereafter), i.e., $F_{0}(\mu_{0})=\frac{1}{2}$ . We now introduce the classical zero assumption, as termed by Efron a different context [21, Chap. 6]. This assumes the existence of a noise-only domain that allows to build a procedure for estimating the null distribution (see Remark 2 hereinafter for a discussion about this assumption).

A 3 (Zero assumption for $F_{1}$ ).

$F_{1}(t)=0$ * for the region $t\leq\mu_{0}$ where the max statistics are most likely under $\mathcal{H}_{0}$ .*

From this assumption, we can now derive the following expression:

[TABLE]

In a similar manner, the survival function $G(t)=\Pr{\left(-T_{\textrm{min}}(\boldsymbol{y})>t\right)}$ of the opposite min statistic $-T_{\textrm{min}}(\boldsymbol{y})$ reads as

[TABLE]

where $G_{0}$ and $G_{1}$ are the survival functions of $-T_{\textrm{min}}(\boldsymbol{y})$ under the null and alternative hypotheses, respectively. This comes from the non-negativity assumption that $-T_{\textrm{min}}(\boldsymbol{y})$ should be stochastically smaller under $\mathcal{H}_{1}$ than under $\mathcal{H}_{0}$ , i.e., $G_{0}(t)>G_{1}(t)$ . Note that $\mu_{0}$ is also the median of the null distribution of $-T_{\textrm{min}}(\boldsymbol{y})$ as $G_{0}(\mu_{0})=1-F_{0}(\mu_{0})=\frac{1}{2}$ according to the proposition II.1. This allows us to introduce the following zero assumption.

A 4 (Zero assumption for $G_{1}$ ).

$G_{1}(t)=0$ * for the region $t\geq\mu_{0}$ where the opposite min statistics are most likely under $\mathcal{H}_{0}$ .*

Thus

[TABLE]

Since $G_{0}(t)=1-F_{0}(t)$ according to the proposition II.1, we can finally derive the following expression:

[TABLE]

The main interest of this expression is that it does not depend on each group distribution function but only on the distribution function of the two-groups model. In particular, assumptions A3 and A4 do not require to fully specify $F_{1}$ , which is unlikely to be known in practice. Expression (9) is therefore robust to alternative miss-specifications. This expression still depends on the theoretical null median $\mu_{0}$ , the proportion $\pi_{0}$ of samples under $\mathcal{H}_{0}$ , and the distribution functions $F(t)$ and $G(t)$ , which are not known. However, when a large number of observations $\boldsymbol{y}_{1},\ldots,\boldsymbol{y}_{n}$ are available, these quantities can be estimated from the empirical distributions of the test statistics. Let

[TABLE]

be the empirical distribution function of $T_{\textrm{max}}$ and the empirical survival function of $-T_{\textrm{min}}$ , respectively.

A 5 (Weak dependence assumption).

The empirical functions $\overline{F}(t)$ and $\overline{G}(t)$ converge uniformly toward the theoretical distribution functions $F(t)$ and $G(t)$ , respectively:

[TABLE]

almost surely as the number $n$ of observations grows to infinity.

Statement A5 is verified under weak dependence conditions between the observed samples $\boldsymbol{y}_{1},\ldots,\boldsymbol{y}_{n}$ . In particular, this holds for independent or short-range dependent samples according to the Glivenko-Cantelli theorem. A direct consequence is the pointwise convergence in probability of $\overline{F}(t)$ and $\overline{G}(t)$ toward $F(t)$ and $G(t)$ , respectively, for any $t\in\mathbb{R}$ .

Due to zero assumptions A3 and A4 and proposition II.1, $\mu_{0}$ satisfies $F(\mu_{0})=G(\mu_{0})=\frac{\pi_{0}}{2}$ . Therefore, based on assumption A5, an estimator of the null median $\mu_{0}$ can be searched as a solution of the following equation for $\mu$ :

[TABLE]

Lemma II.2 (Empirical null median estimator).

Let $t_{(1)}<t_{(2)}<\ldots<t_{(2n)}$ be the ordered values of the statistics belonging to the sample $\boldsymbol{t}=\left(T_{\textrm{max}}(\boldsymbol{y}_{1}),\ldots,T_{\textrm{max}}(\boldsymbol{y}_{n}),-T_{\textrm{min}}(\boldsymbol{y}_{1}),\ldots,-T_{\textrm{min}}(\boldsymbol{y}_{n})\right)$ . Let $\widehat{\mu}_{0}$ be the sample median of $\boldsymbol{t}$ , which is defined as

[TABLE]

Then $\widehat{\mu}_{0}$ satisfies (10) and is a consistent estimator of the null median $\mu_{0}$ , under zero assumptions A3 and A4.

Proof.

See appendix -A1. ∎

Based on the null median estimator given in lemma II.2, we can now obtain the empirical null estimates of $\pi_{0}$ and $F_{0}(t)$ . Let

[TABLE]

be the sample set of the max-test statistics truncated on $(-\infty,\widehat{\mu}_{0}]$ , the elements of which are denoted as $s_{0,i}$ , for $1\leq i\leq n_{0}$ , and where $n_{0}=|\boldsymbol{s}_{0}|$ . Similarly,

[TABLE]

denotes the set of the opposite min statistics truncated on $(\widehat{\mu}_{0},+\infty)$ , the elements of which are denoted as $g_{0,i}$ for $1\leq i\leq n_{0}$ (according to lemma II.2, these two sets are of equal size).

Proposition II.3 (Empirical estimators under $\mathcal{H}_{0}$ ).

Under assumptions A3 and A4,

[TABLE]

is a consistent estimator of the null proportion $\pi_{0}$ , and

[TABLE]

is a pointwise consistent estimator of the null distribution $F_{0}(t)$ , for $t\in\mathbb{R}$ .

Proof.

See appendix -A2. ∎

Remark 1: the dependence structure across a set of observations $\boldsymbol{y}_{1},\ldots,\boldsymbol{y}_{n}$ with $\boldsymbol{y}_{i}\in\mathbb{R}^{l}$ is not required to specify the empirical estimators $\widehat{\pi}_{0}$ and $\widehat{F}_{0}$ given in proposition II.3. These non-parametric estimates rely essentially on the noise symmetry assumption A1, which is very weak. As a consequence, these estimators are robust to miss-specifications that are prone to occur with parametric assumptions.

Remark 2: Zero assumptions A3 and A4 provide an idealized mathematical framework for which the empirical estimators are shown to be consistent. However, these assumptions are unlikely to be satisfied in practice. As a consequence, Eq. (9) is an approximation. Note, however, that the closer $\pi_{0}$ is to one, the more accurate the approximation is. This is the case in many large-scale testing problems. where the null proportion $\pi_{0}$ is usually close to one. The approximation gains also in accuracy the more $F_{0}(t)$ (resp. $G_{0}(t)$ ) dominates $F_{1}(t)$ (resp. $G_{1}(t)$ ) for $t\leq\mu_{0}$ (resp. for $t\geq\mu_{0}$ ). Moreover, if a few observations distributed according to the alternative distribution belong to the regions where they are assumed to be absent, then $\widehat{\pi}_{0}$ tends to be biased upward, and $\widehat{F}_{0}(t)$ tends to be slightly biased toward the alternative distribution $F_{1}(t)$ . It is of note that from a statistical testing perspective, this slight bias goes in the good way. Indeed, a detection procedure based on $\widehat{F}_{0}(t)$ (and possibly $\widehat{\pi}_{0}$ ) then becomes more conservative as $p$ -values are slightly biased upward. This results in a small loss of power but the control of type I errors is still (asymptotically) guaranteed.

Figure 1 shows the empirical density functions associated with the max-test statistics $T_{\textrm{max}}(\boldsymbol{y}_{i})$ and the opposite min statistics $-T_{\textrm{min}}(\boldsymbol{y}_{i})$ for synthetic data with a testing framework that mimics the MUSE one. We can see that the max density has a heavier right tail than the opposite min one. This is due to the contribution of the $\mathcal{H}_{1}$ samples, while the opposite min density right tail is (approximately) distributed according to the theoretical null density due to (approximation) A4. By symmetry, the opposite min density has a heavier left tail than the max density one.

To appreciate the accuracy of the empirical estimators given in proposition II.3, the empirical null-density function that is obtained from the right-truncated sample $\boldsymbol{s}_{0}$ and the left-truncated sample $\boldsymbol{g}_{0}$ that are used to construct $\widehat{F}_{0}(t)$ , is depicted in Figure 2a. Here, the null proportion estimate is larger than the theoretical value: $\widehat{\pi}_{0}=0.89$ and $\pi_{0}=0.81$ . Nevertheless, the empirical null-density function is very close to the theoretical one. This is confirmed by the quantile-quantile plot between $\widehat{F}_{0}$ and the theoretical distribution $F_{0}$ shown in Figure 2b. In particular, this remains true in the distributions tails, where the accuracy of the quantile estimates is crucial to the robustness of the test at low control levels.

II-D Error control

In multiple testing (around $n=2500$ tested pixels for a $50\times 50$ patch in the MUSE context), the classical Type I error control of each individual test might not be appropriate; see e.g., [22, 23]. Indeed the number of wrongly rejected null hypotheses can become relatively important (i.e., even larger than the number of true detections) due to the high number of tests. To address this kind of issue, a global error control approach, namely the FDR, was introduced in [2]. The FDR controls the expected proportion of true null hypotheses wrongly rejected, which are referred to as the false discoveries, among all of the rejected tests:

[TABLE]

where $R$ is the total number of tests where the null hypothesis is rejected, while $U$ is the number of false discoveries among the $R$ discoveries. A simple and widely used approach to control this FDR is the Benjamini and Hochberg (BH) procedure that was also developed in [2]. Let $p_{i}$ be the $p$ -value associated with the $i$ th test statistics. Let $p_{(1)}\leq p_{(2)}\leq\ldots\leq p_{(n)}$ now be the ordered $p$ -values, and $\mathcal{H}_{0}^{(1)},\ldots,\mathcal{H}_{0}^{(n)}$ denote the null hypotheses for this ordering. For a preselected control level $0\leq q\leq 1$ , the BHq procedure rejects $\mathcal{H}_{0}^{(1)},\ldots,\mathcal{H}_{0}^{(\hat{k})}$ where

[TABLE]

with $p_{(0)}=0$ by convention. In our right-sided detection framework, the $i$ th empirical $p$ -value, can be derived from the empirical null distribution given in proposition II.3 as

[TABLE]

Then in the case of $n$ independent tests, or under specific positive dependences [24], the BHq procedure controls the FDR at a level $\pi_{0}q\leq q$ . Thus, if $\pi_{0}$ is known, the BH procedure can be applied at the nominal level $\frac{q}{\pi_{0}}$ to improve its power while controlling FDR at level $q$ . Building on this idea, Storey [25] proposed the following modified estimator of the null proportion $\pi_{0}$ :

[TABLE]

where $\zeta$ has to arbitrarily fixed (usually at $\frac{1}{2}$ ). Storey showed that under the weak dependence assumption A5, the BH ${}_{q^{\prime}}$ procedure at nominal level $q^{\prime}=q/\hat{\pi}_{0}^{*}(\zeta)$ asymptotically controls the FDR at level $q$ . We show in the following that the same strategy can be applied with the empirical null-estimator $\widehat{\pi}_{0}$ .

Proposition II.4 (Storey $\pi_{0}$ estimator).

The empirical null estimator $\widehat{\pi}_{0}$ defined in (12) and the Storey estimator $\hat{\pi}_{0}^{*}(\zeta)$ derived from the empirical $p$ -values defined in (14) are equal for any $\zeta=\frac{k}{2n_{0}}$ with $k\in\{n_{0},\ldots,2n_{0}-1\}$ , and are asymptotically equivalent for any $\zeta\in[\frac{1}{2},1)$ .

Proof.

See appendix -B. ∎

This equivalence is not surprising since, like the proposed empirical null estimators, Storey estimator is based on a zero assumption (i.e., the $p$ -value density function under $\mathcal{H}_{1}$ is zero on $(\zeta,1]$ ).

This leads us to consider the following multiple testing procedure described in Alg. 1.

Note that it has been shown in [22] that for matched filter statistics, with a non-negative template and under Gaussian assumptions, the test statistics obey a positive regression dependence on a subset (PRDS) condition. Therefore, the BH procedure ensures exact FDR control in finite sample settings. Here the problem is more complex. The test statistics are derived from extreme values that can be correlated. Then PRDS is difficult to ensure theoretically, even under Gaussian assumptions on the noise. However, under weak dependence assumption A5, an Oracle procedure similar to Alg. 1, but where the $p$ -values are computed from the theoretical null distribution $F_{0}$ , can be proven to control asymptotically the FDR at level $q$ [25]. As discussed in the previous subsection, the $p$ -value empirical estimates tend to be slightly biased in a conservative way. Moreover, if the null distribution can be estimated on a larger sample than the sample to be tested, the variance of these estimates can be reduced. This supports the asymptotic control of the proposed procedure.

II-E Validation

Figure 3 shows the FDR obtained with Alg. 1, on 3D (spatial + spectral) simulated data that are subjected to weak dependence (spatial kernel convolution), for different levels of nominal control $q$ and different signal-to-noise ratio (SNR). The SNR is defined here as $10\log\frac{A}{nl\sigma^{2}}$ , where $n$ is the number of pixels (i.e., the number of tests to perform), $l$ is the number of spectral bands (i.e., the dimension of the observations $\boldsymbol{y}_{i}$ ), $\sigma^{2}$ is the marginal variance of the noise, and $A=||\boldsymbol{x}||^{2}$ is the energy of the 3D contribution of the signals to be detected. The experimental set-up is similar to Figure 1, except that a spatial convolution kernel of size 3 by 3 was applied to create local spatial correlations. The empirical null estimators defined on section II-C were computed from extended cubes of 200 by 200 pixels by 30 wavelengths.

This figure emphasizes that control of the FDR is correctly achieved for the different SNR levels. As expected, due to the zero assumption approximations, Alg. 1 is a little more conservative than the Oracle one (based on the true $F_{0}$ and $\pi_{0}$ ) at low SNR, where the alternative distribution is closer to the null one.

Table I shows the advantage of assuring a global control with a detection threshold that adapts to the data. It compares this global control with a pixel-wise control based on the probability of false alarm (PFA). Controlling with a $\eta$ (e.g. 5%) PFA threshold results in detecting all pixels with $p$ -values smaller than $\eta$ . Such a PFA control then ensures that in average a fraction $\eta$ of all the tested pixels will be wrongly detected (but says nothing of the proportion of these wrong detections among the detected set).

When confronted to noise-only data, a control procedure with PFA at level 5% detects 144 spurious pixels, that is the size of a possible source. To insure that no source is falsely detected, we may turn to a more conservative level, e.g. 0.1%; this results in poor source detection power (around 55%) whereas a 5% level led to very good power (82%) at the price of a large number of false alarms (false discovery proportion $\simeq 43\%$ ). On the same dataset, a FDR control at 20% does not lead here to any false detection in the absence of source while maintaining a high detection power (72%) in presence of a source, thus adapting to the data. Table I shows that the false discovery proportion is around 18% for a nominal FDR control of 20%. Note that by applying our procedure we can estimate the FDR level that matches a given detection threshold. For instance a threshold on the $p$ -values associated with a 5% PFA level yields a FDR estimate around 44% on data tested here. Table I shows that the false discovery proportion is indeed around 44% for a threshold corresponding to a 5% PFA.

In the next paragraph the ability of the proposed method to control error rate is compared with a generalized likelihood ratio (GLR) approach (inspired by [9] and [10]). Noise is supposed centered Gaussian $\boldsymbol{\epsilon}\sim\mathcal{N}(\boldsymbol{0},\boldsymbol{\Sigma})$ where the covariance matrix $\boldsymbol{\Sigma}\in\mathbb{R}^{l\times l}$ is assumed to be diagonal. We have the following detection test :

[TABLE]

where $||.||_{0}$ is the $\ell_{0}$ -pseudo-norm (number of non-zero components) and $\boldsymbol{a}\geq\boldsymbol{0}$ is the non-negativity constraint on the coefficients. The GLR test with 1-sparsity constraint yields the following test statistic([9])

[TABLE]

where $p(\boldsymbol{y}|\boldsymbol{D}\boldsymbol{a},\mathcal{H}_{1})$ denotes the probability density function of $\boldsymbol{y}$ under $\mathcal{H}_{1}$ and $p(\boldsymbol{y}|\mathcal{H}_{0})$ denotes the probability density function of $\boldsymbol{y}$ under $\mathcal{H}_{0}$ . Using the Gaussian assumption it comes that

[TABLE]

where $\hat{j}$ is the index of the non-zero component of the optimal $\hat{\boldsymbol{a}}$ for the GLR statistics, and $\hat{\boldsymbol{\Sigma}}$ is estimated from the residuals. There is no closed-form expression of the distribution of this statistic since it consists in taking the max of a correlated Gaussian vector. Thus we calibrated this statistic under $\mathcal{H}_{0}$ (normal centered noise) by Monte-Carlo.

Figures 4 and 5 illustrate the main advantage of the proposed method : the control is ensured when the noise distribution is symmetrical without further assumptions. The GLR approach has to be calibrated under $\mathcal{H}_{0}$ distribution so any deviation from the theoretical $\mathcal{H}_{0}$ results in a loss of control, as illustrated by figure 5 where the noise is drawn from a Student distribution. It can be seen that BH procedure based on the theoritical $\mathcal{H}_{0}$ GLR statistics do not correctly control FDR. In a first time the effective FDR strongly exceeds the given control level (allowing GLR to be “more powerful” at a given nominal control level). In a second time it becomes too conservative. This comportment can be explained by the Gaussian fit of the Student distribution of noise: tails are underestimated (hence the excess in FDR) whereas the bulk is overestimated (hence the loss in power in the second part). It should be noted that a classical ROC curve of the two methods would show very similar performance (same power for an effective error budget) between the two approaches but would hide the inaccuracy of error control of the GLR approach. Moreover figure 4 shows that when GLR is at its best (adequate model), the proposed method does stays really close in term of power despite its versatility.

III Application to the MUSE data

It is believed that young galaxies are often surrounded by halos of hydrogen gas, known as the circum galactic medium. The emissions from these halos can be several orders of magnitude fainter than those from the galaxies. Furthermore, the emission spectrum of the halos are composed of narrow lines, notably with the Lyman- $\alpha$ line. The MUSE [1] was developed to detect such emissions. It is a 3D spectrograph that can image and analyze a field of 1 arcmin2 by producing a hyperspectral data cube of 300 by 300 pixels by 3600 wavelengths. Its spectral range covers the visible and near-infrared domain, from 450 nm to 930 nm.

Since the MUSE first light in January 2014, several studies [9, 26] have already been conducted on the MUSE data. The aim has been to detect faint young galaxies, which are also characterized by the presence of a powerful Lyman- $\alpha$ emission line in their emission spectrum. Both methods mostly assume spatially and spectrally punctual sources. As a consequence, they efficiently find the core of galaxies, but are not (yet) adapted to detect the faint extended halos.

The purpose here is to explore the vicinity of these already detected galaxies, and track the Lyman- $\alpha$ emission line as far away from the galaxy as possible.

III-A The MUSE data

The proposed detection method is applied to the MUSE observations of the sky region known as Hubble Deep Field South (HDFS), as it was previously observed by the Hubble space telescope. The MUSE produces huge amounts of data that have to be processed by a data reduction system before they can be used for scientific analysis. In particular, a resampling process creates local correlations in all directions (spatial and spectral) between voxels that cannot be easily modelled due to the data dimensions. The data reduction system applied to the data used here is detailed in [27]. The output is a 300 by 300 pixels by 3600 wavelengths data cube that is associated with a variance cube of the same dimensions. This latter is estimated by propagating the error estimated at the captor level at each stage of the processing.

A catalog of astronomical objects in HDFS was built in [27]. About 90 of these objects are remote galaxies known as Lyman $\alpha$ emitters that are likely to have a halo. For each of these sources, a spatial-spectral neighborhood is defined, which is centered spatially on the galaxy center and spectrally on the emission line peak, and a subcube of 50 by 50 pixels by 30 wavelengths is extracted. This cube extraction is performed for the two following reasons. Spectrally: the signal of interest (hydrogen Lyman $\alpha$ emission line) is concentrated in a few wavelengths around the emission peak. Outside of this domain, galaxy spectra (used to built reference spectrum) can contain other features that are not present in the targeted hydrogen surrounding halo. Spatially: as the targeted halo is expected to stay close to the galaxy, exploring empty (only noise) remote regions would only result in a loss in power (in our global control procedure).

III-B Pre-processing workflow

To deal with the MUSE data, several pre-processing steps are needed. First, the spectral continuum is robustly estimated and removed in each pixel using [28]. Then coarse reduction of the data is carried out using the variance cube provided with the data, followed by robust centering and finer reduction, slice by slice [20]. After the reduction by the variance cube, we can make the following assumption of stationarity of the noise.

A 6 (Stationary noise).

The noise is stationary on each wavelength-slice of the cube.

Spatial matched filter preprocessing

For ground-based astronomical instruments such as MUSE, the spatial system impulse response (FSF) is mainly due to atmospheric turbulence. This FSF is measured for each observation (see [29]), and is independent of the instrumental noise. As such, we can make the following assumption:

A 7.

The noise in the observation model (3) is not filtered by the FSF.

Based on A7, the following strategy was chosen to improve the SNR: a spatial convolution with the FSF (which is modeled as a symmetrical function) is applied to each image of the wavelength axis of the data cube. It is not strictly speaking a spatial matched filter to the searched halo. Indeed, the halo extension and its intensity profile are not known. However, this greatly improves the SNR. The price to be paid is that the theoretical spatial extension of the halo is enlarged by this operation. In practice, the halo has a larger extension than the FSF, with an intensity profile that, as does the FSF, decreases quickly toward zero on its support. Therefore, this effect can be neglected in the detection results.

III-C Detection

In the application on hand, we have the following assumptions:

The galaxy spatial center is already known, as well as the spectral position of the emission line in the galaxy spectrum (with e.g., the method developed in [26]); 2. 2.

The emission line in the halo spectra has a shape similar to the emission line in the galaxy spectrum, the continuum of which has been subtracted, but can present a shift along the spectral wavelengths; 3. 3.

Samples (pixels) are weakly dependent.

The second hypothesis is only partially true: in reality the redshift is not a simple spectral shift, but the composition of a shift and a dilatation. However, at the spectral resolution, the deformation can be neglected. The third assumption is justified because dependences between pixels are mainly due to interpolations during the resampling step, and these dependences have short-range effects. Moreover, in the pre-processing workflow, only the convolution by the FSF creates significant spatial dependences, once again with a short-range effect due to the finite support of the FSF. Thus, the weak dependence assumption A5 is fulfilled. Assumptions required to apply the proposed method are also fulfilled. MUSE data result from the summing of a high number of exposures thus the noise tends to be Gaussian (and as such symmetrical) by application of the central limit theorem. The following numerous preprocessing steps (background subtraction, centering, variance reduction,…) all keep the symmetrical property of the centered noise thus ensuring that assumption A1 hold. Assumption A2 (odd similarity measure) is fulfilled by construction as we choose the SAD measure. As pointed out in Remark 2 of II-C, assumptions A3 and A4 can’t be strictly guaranteed. However, outside of this ideal framework, the key point is that equation (9) is a good enough approximation. Indeed, the targeted signal is assumed to be distinct enough from the background noise and well approximated by the dictionary (see section III-C1); thus $T_{\textrm{max}}(\boldsymbol{y})$ will be significantly stochastically larger under $\mathcal{H}_{1}$ than under $\mathcal{H}_{0}$ for detectable signals. Moreover the galaxy and halo pixels are supposed to be in strong minority in the spatially explored region, that is $\pi_{0}$ is close to one, which enforces these approximations. Finally, as stressed in Remark 2, the approximation errors in equation (9) can only lead to a small loss in power and the control is still guaranteed (the bias is conservative as shown in figure 3 for low SNR).

For a given galaxy, a spatial-spectral neighborhood is defined, centered spatially on the galaxy center and spectrally on the emission line peak. Based on these hypotheses, the approach developed here consists of applying the following steps.

•

Estimate a reference emission line spectrum by averaging the spectra of a few pixels at the center of the galaxy.

•

Create a (highly) coherent family of shifted versions of this reference target signature to build a dictionary.

•

Test each pixel of the defined neighborhood using the method developed in section II.

III-C1 Dictionary

One of the main assumptions here is that the target signature variability can mostly be modeled as a spectral shift. Thus, the dictionary is built here by creating shifted variants of one target signature, $\boldsymbol{d}_{*}$ . Assuming that $\boldsymbol{d}_{*}$ comes from sampling of a continuous model $f(\cdot)$ , we can define as $\boldsymbol{d}_{*}^{\delta}$ , the shifted vector that is obtained by sampling $f(\cdot-\delta)$ . The linearly spaced shifts (LSS) dictionary model on an interval $[-\tau,\tau]$ is then defined for a given size $m$ , as the dictionary $\boldsymbol{D}^{m}$ composed of the atoms $\boldsymbol{d}_{k}=\boldsymbol{d}_{*}^{\tau_{k}}$ , where $\tau_{k}=-\tau+\frac{2\tau}{m-1}k$ , for $k=0,\ldots,m-1$ .

The key question is then the choice of the number $m$ of shifted versions, or in other words, the redundancy of the dictionary. To allow a study of this parameter, we place ourselves in a simplified context:

•

the noise is supposed to be i.i.d. $\mathcal{N}(0,1)$ ;

•

the similarity measure is a spectral matched filter between a dictionary atom and the tested spectrum, as in (6);

•

the reference spectrum $\boldsymbol{d}_{*}$ is a non-negative vector with unit length, where its autocorrelation function $\Gamma(u)=\langle\boldsymbol{d}_{*},\boldsymbol{d}_{*}^{u}\rangle$ is non-increasing in $|u|$ , and has compact support such that $||\boldsymbol{d}_{*}^{u}||=||\boldsymbol{d}_{*}||=1$ , for $u\in[-\tau,\tau]$ ;

•

the target signature $\boldsymbol{x}$ is built from a translation $\boldsymbol{d}^{u}_{0}$ of the reference spectrum $\boldsymbol{x}=a\boldsymbol{d}^{u}_{0}$ , where $a>0$ , and $u$ is a random shift that is uniformly distributed on $[-\tau,\tau]$ .

A measure of the redundancy of a given normalized dictionary $\boldsymbol{D}$ can be given by its coherence, which is defined as $\mu=\max\limits_{i\neq j}{|\langle\boldsymbol{d}_{i},\boldsymbol{d}_{j}\rangle|}$ . For a LSS dictionary $\boldsymbol{D}^{m}$ , and under the aforementioned assumptions, this coherence reduces to the correlation between two consecutive atoms: $\mu=\langle\boldsymbol{d}_{j},\boldsymbol{d}_{j+1}\rangle$ , for $1\leq j<m$ . As illustrated by figure 6, by design of the dictionary, the larger the dictionary size $m$ , the more correlated the atoms are, and the more coherent the dictionary is.

Let $\boldsymbol{z}^{m}=(\boldsymbol{D}^{m})^{T}\boldsymbol{y}\in\mathbb{R}^{m}$ be the vector of the matched filter statistics, the elements of which are defined as $z_{j}^{m}$ for $1\leq j\leq m$ . For a given decision threshold $\eta$ , the PFA for the max-test approach is expressed as

[TABLE]

Here the noise vector $\boldsymbol{\epsilon}$ is $\mathcal{N}(0,\boldsymbol{I}_{m})$ distributed under $\mathcal{H}_{0}$ . If the atoms are orthogonal (e.g., if they have disjoint support), the vector $\boldsymbol{z}^{m}$ is then normally distributed with zero mean and covariance matrix $\boldsymbol{D}^{m}(\boldsymbol{D}^{m})^{T}=\boldsymbol{I}_{m}$ . In this case, we can compute exactly the PFA as

[TABLE]

where $\Phi$ is the cumulative function of the normal distribution. In practice, the dictionary is chosen to be highly coherent (as we want to track close translated versions of a reference spectrum). This requires another way to be found to estimate or maximize this probability.

Proposition III.1.

For any $t\in\mathbb{R}$ and $m\geq 2$ , let $M_{m+1}(t)$ be defined recursively, under $\mathcal{H}_{0}$ , as

[TABLE]

with $M_{2}(t)=\Pr{(z^{2}_{1}\leq t,z^{2}_{2}\leq t)}$ . Under the aforementioned assumptions, an upper boundary of the PFA $\alpha_{m}$ is given by $1-M_{m}(\eta)$ .

Proof.

See Appendix -C. ∎

The interest of expression (17) is that the first factor of the right hand side and the initial value $M_{2}(t)$ can be evaluated numerically based on quadrature rules for trivariate and bivariate normal distribution functions [30], without the requirement for any Monte-Carlo approximation. Thus, this upper boundary can be easily and accurately computed. When the atoms are uncorrelated, this boundary is sharp and reduces to (16). Moreover this allows appreciation of, for a given threshold $\eta$ , the increase in the PFA $\alpha_{m}$ as a function of the dictionary size $m$ for highly correlated atoms. In a reciprocal way, this allows the evaluation of the threshold $\eta_{m}$ , which ensures a false alarm rate that is lower than a given $\alpha$ for any $m\geq 1$ .

We can now estimate roughly the potential detection gain under $\mathcal{H}_{1}$ as a function of $m$ , as follows: Under $\mathcal{H}_{1}$ , we have assumed that

[TABLE]

with a shift $u\sim\mathcal{U}([-\tau,\tau])$ . Then, if we assume that the maximum is obtained for the closest atom, which is by assumption the more correlated with $\boldsymbol{d}^{u}_{*}$ , the expected max-test statistic can be approximated by

[TABLE]

where $\Gamma(\cdot)$ is the autocorrelation function of $\boldsymbol{d}_{*}$ and $e_{m}\sim\mathcal{U}([0,\tau/(m-1)])$ is the shift between $\boldsymbol{d}^{u}_{*}$ and the closest atom.

Using this expected max-test value under $\mathcal{H}_{1}$ and the upper boundary on the false alarm given in proposition III.1, we can see in Figure 8 that when the dictionary size $m$ increases, the max-test statistic can still increase under $\mathcal{H}_{1}$ . However, for fixed level control $\alpha$ , the test threshold (upper boundary) $\eta_{m}$ does not increase significantly above a certain size, e.g., $m\geq 10$ in Figure 8. This is clearly explained by the stronger correlations when adding new atomsConversely, if the atoms are uncorrelated, we see that the threshold deduced from (16) increases faster than the potential gain of the max-test statistic under $\mathcal{H}_{1}$ .

This is confirmed in Figure 7, which shows different empirical ROC curves for different sized $m$ of the LSS dictionaries. It can be seen empirically that the more coherent the dictionary, the more powerful becomes the max test.

As a consequence, in the application, the dictionary will be built to be as coherent as possible for the spectral resolution of the MUSE instrument. The reference atom $\boldsymbol{d}_{*}$ is estimated by averaging the spectra of the 5 pixels at the spatial intensity peak of the galaxy. The spectrum is limited to a $l=30$ spectral band area centered on the spectral emission peak, which ensures the presence of the whole emission line feature. Based on astronomical priors, the spectral shift is limited to the interval $[-\tau,\tau]$ with $\tau=7$ MUSE spectral bands (i.e., $\tau\approx$ 9\text{,}\mathrm{\SIUnitSymbolAngstrom}$$). Shifting is done at the spectral resolution of the instrument, to avoid any interpolations. The dictionary $\boldsymbol{D}^{m}$ is finally built with the atoms corresponding to these $m=15$ shifted versions of $\boldsymbol{d}_{*}$ .

III-C2 Similarity measure

To test whether a given spectrum belongs to the extended source, the similarity measure used in this application is the SAD, as defined in (7). Of note, other metrics were explored to build the test statistic, such as the matched filter one (6) and the spectral information divergence defined in [31]. Spectral information divergence is built upon the symmetrical Kullback-Leibler divergence, and it compares the spectra as distribution densities. As it demands positive signals for its computation, it cannot be used directly for our problem, as the MUSE data can be negative due to high symmetrical noise levels. The matched filter approach can be used on the MUSE data, and it gives good results. However SAD appears to be more robust to some systematics of the MUSE data cubes, such as the edges where there is higher variability, and it is preferred here.

III-D Results on real data

The results on several subcubes of $n=50\times 50$ pixels by $l=30$ wavelengths centered around interesting objects of the HDFS catalog are shown in Figure 9 for the detection procedure described in Alg. 1. For each of the $n=50\times 50=2500$ spectra, the max-test statistics (8) are obtained from the SAD similarity measure, and with a highly coherent dictionary constructed as described in the last paragraph of section III-C1. The empirical null estimators are computed on larger subcubes (centered around the subcube to be tested) that are composed of 200 by 200 spectra. For each object, the first row shows the narrow band image around the emission line (the data subcube is totalled along the $l=30$ spectral bands centered on the emission line peak). The second column of the first row shows the same narrow-band image, but after the different preprocessing steps (which include continuum subtraction and FSF convolution). The last column shows the reference spectrum $\boldsymbol{d}_{*}$ , built from the pixels at the center of the studied object. On the second row, the first column shows the maps of the empirical $p$ -values (14) obtained for the $n=2500$ spectra. The maps of the $q$ -values are depicted on the second column. Q-values were introduced in [32] and can be seen as the FDR counterpart of the $p$ -values. For each test statistic, it is defined as the minimal FDR that allows this test statistic to be a discovery. In our detection framework, this is the minimal FDR control level $q\in[0,1]$ in Alg. 1, such that a given spectrum is detected as part of the halo. The interest in this global measure of significance is clear here, as it allows us to present more contrasted significance maps than the classical $p$ -value maps. The third column shows the binary detection map provided by Alg. 1 for a nominal FDR control level $q=0.2$ . The contours of the detection region for different values of $q$ are also superimposed. From these maps, it can be seen that several of these objects show clear asymmetry, and that they extend beyond the simple support of a punctual source (the black circles show the support of an estimated FSF). Studies are currently being conducted by astronomers at the Centre de Recherche Astrophysique de Lyon to analyze these results and to apply the method to other sky fields.

IV Conclusions

In this paper, a new method is proposed to answer a detection problem of a weak target signature that is partially known, but with a possible large variability within an unknown background that is difficult to model. To answer to this problem, an unsupervised detector was proposed, based on a maximum test approach, as studied in [13]. This detector takes explicitly the possible target variability into account by using a highly coherent dictionary. It does not need any knowledge of the background, but a simple noise symmetry assumption, and the non-negativity of the sparse representation of the targeted signal. This allows to estimate the test statistic distribution and to implement a simple detection procedure robust to model/background miss-specification. Moreover, the error control was developed based on a false discovery rate approach, and a global measure of the significance was obtained. Such a control with detection threshold that adapts to the data is not yet widely used in the signal processing community whereas it is highly pertinent for processing massive datasets. This whole new process was tested on real MUSE data. The promising obtained results are presently analyzed by astronomers. Future extensions of this original method will account for the existence of spatial structure of the target while controlling the FDR. The MUSE data used in this paper is now publicly available at http://muse-vlt.eu/science/hdfs-v1-0/, and the Python code of the proposed method is available on demand.

Acknowledgments

The authors would like to thank the ERC 339659-MUSICOS for its funding, Roland Bacon and Floriane Leclercq for their expertise on the MUSE data, and Jean-Baptiste Courbot for numerous interesting exchanges on weak target detection.

-A Proof for the empirical null estimators

-A1 Proof of lemma II.2

Let $\boldsymbol{g}=\left(T_{\textrm{max}}(\boldsymbol{y}_{1}),\ldots,T_{\textrm{max}}(\boldsymbol{y}_{n})\right)$ be the set composed of the $n$ max statistics, the elements of which are denoted as $g_{i}$ for $1\leq i\leq n$ Similarly $\boldsymbol{s}=\left(-T_{\textrm{min}}(\boldsymbol{y}_{1}),\ldots,-T_{\textrm{min}}(\boldsymbol{y}_{n})\right)$ is the set composed of the $n$ opposite min statistics, the elements of which are denoted as $s_{i}$ for $1\leq i\leq n$ .

We first show that $\widehat{\mu}_{0}$ verifies (10), i.e., that $\#\{g_{i}\leq\widehat{\mu}_{0}\}=\#\{s_{i}>\widehat{\mu}_{0}\}$ . For absolutely continuous distributions, $\Pr(t_{(n)}=t_{(n+1)})=0$ . Thus from (11), we get that $\#\{t_{i}\leq\widehat{\mu}_{0}\}=n$ with probability one. The sample set $\boldsymbol{t}$ is the union of $\boldsymbol{g}$ and $\boldsymbol{s}$ : if $m_{0}=\#\{g_{i}\leq\widehat{\mu}_{0}\}$ , then $\#\{s_{i}\leq\widehat{\mu}_{0}\}=n-m_{0}$ . As a consequence, $\#\{s_{i}>\widehat{\mu}_{0}\}=n-(n-m_{0})=m_{0}=\#\{g_{i}\leq\widehat{\mu}_{0}\}$ , which shows that $\widehat{\mu}_{0}$ verifies (10).

We show now that $\widehat{\mu}_{0}$ converges in probability toward $\mu_{0}$ . As $\widehat{\mu}_{0}$ satisfies (10), and $\overline{F}(t)$ (resp. $\overline{G}(t)$ ) converges in probability to $F(t)$ (resp. $G(t)$ ) for any $t\in\mathbb{R}$ , $\widehat{\mu}_{0}$ converges in probability to the solution of

[TABLE]

if this equation admits a unique solution. Assuming that the median of $F_{0}$ is uniquely defined, it follows that for $t>\mu_{0}$

[TABLE]

Moreover, for $t>\mu_{0}$

[TABLE]

where the first equality is due to zero assumption A4. As a consequence, there is no solution of $(\mu_{0},+\infty)$ . Similarly, according to zero assumption A3, that for $t<\mu_{0}$ , $F(t)<G(t)$ . Therefore the unique solution is for $t=\mu_{0}$ , where $F(\mu_{0})=\pi_{0}F_{0}(\mu_{0})=\pi_{0}/2=\pi_{0}G_{0}(\mu_{0})=G(\mu_{0})$ , which concludes the proof. $\square$

-A2 Proof of proposition II.3

We first show that the $\pi_{0}$ estimator given in (12) is consistent. From lemma II.2, $\widehat{\mu}_{0}$ converges in probability toward $\mu_{0}$ : $\widehat{\mu}_{0}\overset{P}{\longrightarrow}\mu_{0}$ . The triangular inequality ensures that $|\overline{F}(\widehat{\mu}_{0})-F(\mu_{0})|\leq|\overline{F}(\widehat{\mu}_{0})-F(\widehat{\mu}_{0})|+|F(\widehat{\mu}_{0})-F(\mu_{0})|$ . The first term of the right-hand side is dominated by $\sup_{t}|\overline{F}(t)-F(t)|$ ,, which converges in probability toward [math], according to assumption A5. The second term also converges in probability toward [math], according to the continuous mapping theorem. Thus $\overline{F}(\widehat{\mu}_{0})\overset{P}{\longrightarrow}F(\mu_{0})$ .

According to (9), $F(\mu_{0})=\pi_{0}F_{0}(\mu_{0})=\frac{\pi_{0}}{2}$ . As $2\overline{F}\left(\widehat{\mu}_{0}\right)=2\frac{n_{0}}{n}$ , this shows that

[TABLE]

Thus $\widehat{\pi}_{0}=\min\left\{\widetilde{\pi}_{0},1\right\}$ also converges in probability to $\pi_{0}$ .

We show now the consistency of (13) for $t\in\mathbb{R}$ . From (9) and assumption A5, it follows now that $\overline{F}(t)\overset{P}{\longrightarrow}\pi_{0}F_{0}(t)$ for all $t\leq\mu_{0}$ . Then, according to the Slutsky theorem, $\overline{F}(t)/\widetilde{\pi}_{0}\overset{P}{\longrightarrow}F_{0}(t)$ . As $\widehat{F}_{0}(t)=\overline{F}(t)/\widetilde{\pi}_{0}$ for all $t\leq\mu_{0}$ , this shows the consistency for $t\leq\mu_{0}$ . The demonstration for $t>\mu_{0}$ can be done in a similar manner, by noting that $\widehat{F}_{0}(t)=1-\overline{G}(t)/\widetilde{\pi}_{0}$ for $t>\mu_{0}$ . $\square$

-B Proof of proposition II.4

Let $T_{\textrm{max}}(\boldsymbol{y}_{(1)})\leq T_{\textrm{max}}(\boldsymbol{y}_{(2)})\leq\cdots\leq T_{\textrm{max}}(\boldsymbol{y}_{(n)})$ be the ordered max-test statistics, while $p_{(1)}\leq p_{(2)}\leq\cdots\leq p_{(n)}$ are the ordered $p$ -values (while $p_{i}$ notes the p-value associated with pixel $i$ ). From (14), it follows that,

[TABLE]

where $j=n-i+1$ . For $j\leq n_{0}$ , $T_{\textrm{max}}(\boldsymbol{y}_{(j)})\leq\mu_{0}$ thus $\#\left\{s_{0,i}\leq T_{\textrm{max}}(\boldsymbol{y}_{(j)})\right\}=j$ and $\#\left\{g_{0,i}\leq T_{\textrm{max}}(\boldsymbol{y}_{(j)})\right\}=0$ . So, $2n_{0}\widehat{F}_{0}\left(T_{\textrm{max}}(\boldsymbol{y}_{(j)})\right)=j$ for $j\leq n_{0}$ , that is $2n_{0}p_{(i)}=2n_{0}-n+i-1$ for $i\geq n-n_{0}+1$ . For $k\geq n_{0}$ , let $i_{k}=n-2n_{0}+1+k$ . Then $i_{k}\geq n-n_{0}+1$ so $2n_{0}p_{(i_{k})}=2n_{0}-n+i_{k}-1=k$ . Thus $\#\{2n_{0}p_{i}>k\}=n-i_{k}=2n_{0}-k-1$ . If $\zeta={k}/{2n_{0}}$ , then $\#\{p_{i}>\zeta\}=\#\{2n_{0}p_{i}>k\}$ . Thus $\frac{1+\#\{p_{i}>\zeta\}}{(1-\zeta)n}=\frac{2n_{0}-k}{(1-k/2n_{0})n}=\frac{2n_{0}}{n}$ , which shows that $\hat{\pi}_{0}^{*}(\zeta)=\widehat{\pi}_{0}$ .

In the general case where $\zeta\in[\frac{1}{2},1)$ , then $\zeta$ and $k_{\zeta}/2n_{0}$ , where $k_{\zeta}=\left\lfloor{2n_{0}\zeta}\right\rfloor\in\{n_{0},\ldots,2n_{0}-1\}$ , are asymptotically equivalent when $n_{0}$ grows to infinity. Thus, $\hat{\pi}_{0}^{*}(\zeta)$ and $\hat{\pi}_{0}^{*}(k_{\zeta}/2n_{0})=\widehat{\pi}_{0}$ are asymptotically equivalent. This concludes the proof. $\square$

-C Proof of proposition III.1

Under $\mathcal{H}_{0}$ , for a threshold $t$ we have:

[TABLE]

As $\boldsymbol{D}^{m}\geq 0$ and $\boldsymbol{y}\sim\mathcal{N}(0,\boldsymbol{I}_{m})$ under $\mathcal{H}_{0}$ , $\boldsymbol{z}^{m}$ is positively associated in the sense of [33]. Thus,

[TABLE]

Using the numerical procedures given in [30], we can accurately compute the right-hand side term of (18). Note that this term gives a relatively sharp lower boundary, because $z^{m+1}_{2}$ and $z^{m+1}_{3}$ are the more correlated variables with $z^{m+1}_{1}$ among the $z^{m+1}_{j}$ for $j\geq 2$ .

Moreover, by construction, the shifts between the atoms in $\boldsymbol{D}^{m+1}$ are smaller than those for the atoms in $\boldsymbol{D}^{m}$ . With the autocorrelation function assumed to be non-increasing with the absolute shifts, it follows that the size $m$ Gaussian random vector $\left(z^{m+1}_{2},...,z^{m+1}_{m+1}\right)$ has larger correlations than the size $m$ Gaussian random vector $\left(z^{m}_{1},...,z^{m}_{m}\right)$ . By assumption, these two vectors are centered with unit marginal variances under $\mathcal{H}_{0}$ . Thus the Slepian lemma [34] yields that:

[TABLE]

By combining (18) and (19), we can then minimise $\Pr{(\max\boldsymbol{z}^{m}\leq t)}$ by a function $M_{m}(t)$ that is defined recursively as:

[TABLE]

where $M_{2}(t)=\Pr{(z^{2}_{1}\leq t,z^{2}_{2}\leq t)}$ . This gives the upper boundary for the PFA of proposition III.1. Numerical computations emphasize that $M_{m}(t)$ increases with $t$ . Then it is possible to (numerically) inverse $M_{m}(\eta)$ to get $\eta_{m}$ for a control level $\alpha$ : $\eta_{m}=M_{m}^{-1}(1-\alpha)$ verifies $\Pr(\max\boldsymbol{z}^{m}>\eta_{m})\leq\alpha$ . $\square$

Bibliography34

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] R. Bacon, M. Accardo, L. Adjali, H. Anwand, S. Bauer, I. Biswas, J. Blaizot, D. Boudon, S. Brau-Nogue, J. Brinchmann et al. , “The muse second-generation vlt instrument,” in SPIE Astronomical Telescopes+ Instrumentation . International Society for Optics and Photonics, 2010, pp. 773 508–773 508.
2[2] Y. Benjamini and Y. Hochberg, “Controlling the false discovery rate: a practical and powerful approach to multiple testing,” Journal of the Royal Statistical Society. Series B (Methodological) , pp. 289–300, 1995.
3[3] D. Manolakis, R. Lockwood, T. Cooley, and J. Jacobson, “Is there a best hyperspectral detection algorithm?” in SPIE Defense, Security, and Sensing . International Society for Optics and Photonics, 2009, pp. 733 402–733 402.
4[4] I. S. Reed and X. Yu, “Adaptive multiple-band cfar detection of an optical pattern with unknown spectral distribution,” Acoustics, Speech and Signal Processing, IEEE Transactions on , vol. 38, no. 10, pp. 1760–1770, 1990.
5[5] D. G. Manolakis, G. A. Shaw, and N. Keshava, “Comparative analysis of hyperspectral adaptive matched filter detectors,” in Aero Sense 2000 . International Society for Optics and Photonics, 2000, pp. 2–17.
6[6] L. L. Scharf and L. T. Mc Whorter, “Adaptive matched subspace detectors and adaptive coherence estimators,” in Signals, Systems and Computers, 1996. Conference Record of the Thirtieth Asilomar Conference on . IEEE, 1996, pp. 1114–1117.
7[7] J. Solomon and B. Rock, “Imaging spectrometry for earth remote sensing,” Science , vol. 228, no. 4704, pp. 1147–1152, 1985.
8[8] Y. Chen, N. M. Nasrabadi, and T. D. Tran, “Sparse representation for target detection in hyperspectral imagery,” IEEE Journal of Selected Topics in Signal Processing , vol. 5, no. 3, pp. 629–640, 2011.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Robust control of varying weak hyperspectral target detection with sparse non-negative representation

Abstract

I Introduction

Notations

II Detection method

II-A Testing problem

II-B Test statistic

II-C Learning the null distribution

A 1**.**

A 2**.**

Proposition II.1**.**

Proof.

A 3** (Zero assumption for F1F_{1}F1​).**

A 4** (Zero assumption for G1G_{1}G1​).**

A 5** (Weak dependence assumption).**

Lemma II.2** (Empirical null median estimator).**

Proof.

Proposition II.3** (Empirical estimators under H0\mathcal{H}_{0}H0​).**

Proof.

II-D Error control

Proposition II.4** (Storey π0\pi_{0}π0​ estimator).**

Proof.

II-E Validation

III Application to the MUSE data

III-A The MUSE data

III-B Pre-processing workflow

A 6** (Stationary noise).**

Spatial matched filter preprocessing

A 7**.**

III-C Detection

III-C1 Dictionary

Proposition III.1**.**

Proof.

III-C2 Similarity measure

III-D Results on real data

IV Conclusions

Acknowledgments

-A Proof for the empirical null estimators

-A1 Proof of lemma II.2

-A2 Proof of proposition II.3

-B Proof of proposition II.4

-C Proof of proposition III.1

A 1.

A 2.

Proposition II.1.

A 3 (Zero assumption for $F_{1}$ ).

A 4 (Zero assumption for $G_{1}$ ).

A 5 (Weak dependence assumption).

Lemma II.2 (Empirical null median estimator).

Proposition II.3 (Empirical estimators under $\mathcal{H}_{0}$ ).

Proposition II.4 (Storey $\pi_{0}$ estimator).

A 6 (Stationary noise).

A 7.

Proposition III.1.