Detecting new signals under background mismodelling

Sara Algeri

arXiv:1906.06615·physics.data-an·January 15, 2020

Detecting new signals under background mismodelling

Sara Algeri

PDF

1 Repo

TL;DR

This paper introduces a unified statistical approach to detect new signals in astrophysics experiments, effectively handling background mismodelling uncertainties to improve sensitivity and reduce false discoveries.

Contribution

It presents a nonparametric method that incorporates partial scientific knowledge and updates background models without relying on prior distributions.

Findings

01

Method improves detection sensitivity under background uncertainties

02

Application to dark matter searches demonstrates robustness

03

Handles violations of classical distributional assumptions

Abstract

Searches for new astrophysical phenomena often involve several sources of non-random uncertainties which can lead to highly misleading results. Among these, model-uncertainty arising from background mismodelling can dramatically compromise the sensitivity of the experiment under study. Specifically, overestimating the background distribution in the signal region increases the chances of missing new physics. Conversely, underestimating the background outside the signal region leads to an artificially enhanced sensitivity and a higher likelihood of claiming false discoveries. The aim of this work is to provide a unified statistical strategy to perform modelling, estimation, inference, and signal characterization under background mismodelling. The method proposed allows to incorporate the (partial) scientific knowledge available on the background distribution and provides a data-updated…

Figures33

Click any figure to enlarge with its caption.

Tables2

Table 1. Table 1 : Comparison of deviance test and classical inferential tools. The first two columns report the p-values of Anderson-Darling anderson and Cramer-von Mises darling goodness-of-fit tests obtained assuming as theoretical distribution the same G 𝐺 G indicated in Sections V.1 and V.2 for the the calibration phase, case I, II and III, respectively. The raw deviance p-values and their post-selection adjusted counterparts are reported in the third columns. Finally, the fouth and fifth column report, respectively, the Kolmogorov-Smirnov darling and Wilcoxon rank sum wilcoxon tests used to compare directly the physics samples in Case I, II and III with the source-free sample used in Section V.1 .

	Goodness-of-fit test p-values			Two-samples test p-values
Sample	Anderson-Darling	Cramer-von Mises	Deviance (adjusted)	Kolmogorov-Smirnov	Wilcoxon Rank Sum
Calibration	$1.2 \cdot 10^{- 7}$	$4.2 \cdot 10^{- 7}$	$3.2 \cdot 10^{- 12}$ ( $6.4 \cdot 10^{- 11}$ )	-	-
Case I	$0.7776$	$0.7711$	$0.2657$ ( $> 1$ )	0.9248	0.5487
Case II	$4.6 \cdot 10^{- 7}$	$8.2 \cdot 10^{- 11}$	$9.0 \cdot 10^{- 33}$ ( $1.8 \cdot 10^{- 31}$ )	$1.9 \cdot 10^{- 13}$	$4.5 \cdot 10^{- 12}$
Case III	$4.6 \cdot 10^{- 7}$	$2.6 \cdot 10^{- 10}$	$2.6 \cdot 10^{- 28}$ ( $5.2 \cdot 10^{- 27}$ )	$2.1 \cdot 10^{- 15}$	$2.2 \cdot 10^{- 16}$

Table 2. Table 2 : Model selection and inference for the toy example in Section V . The second column reports the M and k M ∗ subscript superscript 𝑘 𝑀 k^{*}_{M} values selected as in ( 32 ) and ( 48 ), respectively. The third column collects the unadjusted deviance p-values for the full and denoised solutions. The Bonferroni-adjusted p-values, computed as in ( 33 ) and ( 49 ) are reported in the fourth column. The correction terms applied correspond to M max = 20 subscript 𝑀 20 M_{\max}=20 for the full solution and M t o t = M max + M ( M − 1 ) 2 subscript 𝑀 𝑡 𝑜 𝑡 subscript 𝑀 𝑀 𝑀 1 2 M_{tot}=M_{\max}+\frac{M(M-1)}{2} for the denoised solution.

	$M, k_{M}^{*}$	Method	Deviance	Adjusted
	selected		p-values	p-values
Toy example	$M = 2$	Full	$p (2) = 3.199 \cdot 10^{- 12}$	$20 \cdot p (2) = 6.397 \cdot 10^{- 11}$
Calibration	$k_{2}^{*} = 2$	Denoised	$p (2) = 3.199 \cdot 10^{- 12}$	$21 \cdot p (2) = 6.717 \cdot 10^{- 11}$
Toy example	M=18	Full	$p (18) = 0.2657$	$20 \cdot p (18) > 1$
Case I	$k_{18}^{*} = 2$	Denoised	$p (2) = 5.096 \cdot 10^{- 4}$	$21 \cdot p (2) = 0.0882$
Toy example	$M = 4$	Full	$p (4) = 8.994 \cdot 10^{- 33}$	$20 \cdot p (4) = 1.799 \cdot 10^{- 31}$
Case II	$k_{4}^{*} = 4$	Denoised	$p (4) = 8.994 \cdot 10^{- 33}$	$21 \cdot p (4) = 2.338 \cdot 10^{- 31}$
Toy example	$M = 9$	Full	$p (9) = 2.590 \cdot 10^{- 28}$	$20 \cdot p (9) = 5.181 \cdot 10^{- 27}$
Case III	$k_{9}^{*} = 6$	Denoised	$p (6) = 4.457 \cdot 10^{- 30}$	$35 \cdot p (1) = 2.496 \cdot 10^{- 28}$

Equations149

f (x) = g (x) d (G (x); G, F)

f (x) = g (x) d (G (x); G, F)

d (u; G, F) = \frac{f ( G ^{- 1} ( u ))}{g ( G ^{- 1} ( u ))} with 0 \leq u \leq 1,

d (u; G, F) = \frac{f ( G ^{- 1} ( u ))}{g ( G ^{- 1} ( u ))} with 0 \leq u \leq 1,

D(u)=F\bigl{(}G^{-1}(u)\bigl{)}=\int_{0}^{u}d(v;G,F)\partial v,

D(u)=F\bigl{(}G^{-1}(u)\bigl{)}=\int_{0}^{u}d(v;G,F)\partial v,

d (u; G, F) = 1 + j > 0 \sum L P_{j} L e g_{j} (u)

d (u; G, F) = 1 + j > 0 \sum L P_{j} L e g_{j} (u)

L P_{j} = \frac{1}{n} i = 1 \sum n L e g_{j} (u_{i}) .

L P_{j} = \frac{1}{n} i = 1 \sum n L e g_{j} (u_{i}) .

L P_{j} = \int_{0}^{1} L e g_{j} (u) \partial \tilde{D} (u) = \int_{0}^{1} L e g_{j} (u) \partial \tilde{F} (G^{- 1} (u))

L P_{j} = \int_{0}^{1} L e g_{j} (u) \partial \tilde{D} (u) = \int_{0}^{1} L e g_{j} (u) \partial \tilde{F} (G^{- 1} (u))

E [L P_{j}] = L P_{j}, V (L P_{j}) = \frac{σ _{j}^{2}}{n} and C o v (L P_{j}, L P_{k}) = \frac{σ _{j k}}{n}

E [L P_{j}] = L P_{j}, V (L P_{j}) = \frac{σ _{j}^{2}}{n} and C o v (L P_{j}, L P_{k}) = \frac{σ _{j k}}{n}

E [L P_{j}] = 0, V (L P_{j}) = \frac{1}{n} and C o v (L P_{j}, L P_{k}) = 0

E [L P_{j}] = 0, V (L P_{j}) = \frac{1}{n} and C o v (L P_{j}, L P_{k}) = 0

d (u; G, F) = 1 + j = 1 \sum M L P_{j} L e g_{j} (u),

d (u; G, F) = 1 + j = 1 \sum M L P_{j} L e g_{j} (u),

V\bigl{[}\widehat{d}(u;G,F)\bigl{]}=\sum_{j=1}^{M}\frac{\sigma^{2}_{j}}{n}Leg^{2}_{j}(u)+2\sum_{j<k}\frac{\sigma_{jk}}{n}Leg_{j}(u)Leg_{k}(u).

V\bigl{[}\widehat{d}(u;G,F)\bigl{]}=\sum_{j=1}^{M}\frac{\sigma^{2}_{j}}{n}Leg^{2}_{j}(u)+2\sum_{j<k}\frac{\sigma_{jk}}{n}Leg_{j}(u)Leg_{k}(u).

σ_{j}^{2} σ_{j k} = \frac{1}{n} i = 1 \sum n (L e g_{j} (u_{i}) - L P_{j})^{2} = \frac{1}{n} i = 1 \sum n L e g_{j} (u_{i}) L e g_{k} (u_{i}) - L P_{j} L P_{k} .

σ_{j}^{2} σ_{j k} = \frac{1}{n} i = 1 \sum n (L e g_{j} (u_{i}) - L P_{j})^{2} = \frac{1}{n} i = 1 \sum n L e g_{j} (u_{i}) L e g_{k} (u_{i}) - L P_{j} L P_{k} .

f (x) = g (x) d (G (x); G, F) .

f (x) = g (x) d (G (x); G, F) .

\widehat{LP}_{2}=\frac{1}{n}\sum_{i=1}^{n}Leg_{2}(u_{i})=\sqrt{5}\Bigl{(}6\widehat{\mu}_{2}-6\widehat{\mu}_{1}+1\Bigl{)}

\widehat{LP}_{2}=\frac{1}{n}\sum_{i=1}^{n}Leg_{2}(u_{i})=\sqrt{5}\Bigl{(}6\widehat{\mu}_{2}-6\widehat{\mu}_{1}+1\Bigl{)}

M I S E

M I S E

= j = 1 \sum M \frac{σ _{j}^{2}}{n} + j > M \sum L P_{j}^{2}

IBS=\bigintsss_{0}^{1}\biggl{(}E\bigl{[}\widehat{d}(u;G,F)\bigl{]}-d(u;G,F)\biggl{)}^{2}\partial u.

IBS=\bigintsss_{0}^{1}\biggl{(}E\bigl{[}\widehat{d}(u;G,F)\bigl{]}-d(u;G,F)\biggl{)}^{2}\partial u.

IBS=\bigintsss_{0}^{1}\biggl{(}\frac{f(x)-g(x)}{g(x)}\biggl{)}^{2}\partial u-\sum_{j=1}^{M}LP^{2}_{j}

IBS=\bigintsss_{0}^{1}\biggl{(}\frac{f(x)-g(x)}{g(x)}\biggl{)}^{2}\partial u-\sum_{j=1}^{M}LP^{2}_{j}

f_{b}(x)=\frac{e^{-\frac{1}{2}\bigl{(}\frac{x-55}{15}\bigl{)}^{2}}}{k_{fb}}

f_{b}(x)=\frac{e^{-\frac{1}{2}\bigl{(}\frac{x-55}{15}\bigl{)}^{2}}}{k_{fb}}

g_{b} (x) = \frac{9.52 - 2.22 x + 0.15 x ^{2}}{k _{g b}}

g_{b} (x) = \frac{9.52 - 2.22 x + 0.15 x ^{2}}{k _{g b}}

f_{s}(x)=\frac{e^{-\frac{1}{2}\bigl{(}\frac{x-25}{4.5}\bigl{)}^{2}}}{k_{fs}}

f_{s}(x)=\frac{e^{-\frac{1}{2}\bigl{(}\frac{x-25}{4.5}\bigl{)}^{2}}}{k_{fs}}

f_{b} (x) = g_{b} (x) d (G_{b} (x); G_{b}, F_{b})

f_{b} (x) = g_{b} (x) d (G_{b} (x); G_{b}, F_{b})

d (G_{b} (x); G_{b}, F_{b}) = 1 + 0.063 L e g_{1} [G_{b} (x)] - 0.082 L e g_{2} [G_{b} (x)],

d (G_{b} (x); G_{b}, F_{b}) = 1 + 0.063 L e g_{1} [G_{b} (x)] - 0.082 L e g_{2} [G_{b} (x)],

H_{0} : d (u; G, F) = 1 H_{1} : d (u; G, F) \neq = 1 for all u \in [0, 1] v s for some u \in [0, 1] .

H_{0} : d (u; G, F) = 1 H_{1} : d (u; G, F) \neq = 1 for all u \in [0, 1] v s for some u \in [0, 1] .

H_{0} : j = 1 \sum M L P_{j}^{2} = 0 vs H_{1} : j = 1 \sum M L P_{j}^{2} > 0

H_{0} : j = 1 \sum M L P_{j}^{2} = 0 vs H_{1} : j = 1 \sum M L P_{j}^{2} > 0

D_{M} = n j = 1 \sum M L P_{j}^{2} .

D_{M} = n j = 1 \sum M L P_{j}^{2} .

n L P_{j} d N (0, 1),

n L P_{j} d N (0, 1),

P (D_{M} > d_{M}) n \to \infty P (χ_{M}^{2} > d_{M}),

P (D_{M} > d_{M}) n \to \infty P (χ_{M}^{2} > d_{M}),

1 - α = P (- c_{α} \leq d (u; G, F) - 1 \leq c_{α}, for all u \in [0, 1] ∣ H_{0}) = P (u max ∣ d (u; G, F) - 1∣ \leq c_{α} ∣ H_{0})

1 - α = P (- c_{α} \leq d (u; G, F) - 1 \leq c_{α}, for all u \in [0, 1] ∣ H_{0}) = P (u max ∣ d (u; G, F) - 1∣ \leq c_{α} ∣ H_{0})

SE\Bigl{[}\widehat{d}(u;G,F)|H_{0}\Bigl{]}=\sqrt{\sum_{j=1}^{M}\frac{1}{n}Leg^{2}_{j}(u)}.

SE\Bigl{[}\widehat{d}(u;G,F)|H_{0}\Bigl{]}=\sqrt{\sum_{j=1}^{M}\frac{1}{n}Leg^{2}_{j}(u)}.

\frac{d ( u ; G , F ) - 1}{\sum _{j = 1}^{M} \frac{1}{n} L e g _{j}^{2} ( u )} d N (0, 1) .

\frac{d ( u ; G , F ) - 1}{\sum _{j = 1}^{M} \frac{1}{n} L e g _{j}^{2} ( u )} d N (0, 1) .

\Biggl{[}1-c_{\alpha}\sqrt{\sum_{j=1}^{M}\frac{1}{n}Leg_{j}^{2}(u)},1+c_{\alpha}\sqrt{\sum_{j=1}^{M}\frac{1}{n}Leg_{j}^{2}(u)}\Biggl{]},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Yorkee2018/LPBkg
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Version of

††thanks: The author declares no conflict of interest.

Detecting new signals under background mismodelling

Sara Algeri

[email protected]

School of Statistics, University of Minnesota, Minneapolis (MN), 55455, USA

Abstract

Searches for new astrophysical phenomena often involve several sources of non-random uncertainties which can lead to highly misleading results. Among these, model-uncertainty arising from background mismodelling can dramatically compromise the sensitivity of the experiment under study. Specifically, overestimating the background distribution in the signal region increases the chances of missing new physics. Conversely, underestimating the background outside the signal region leads to an artificially enhanced sensitivity and a higher likelihood of claiming false discoveries. The aim of this work is to provide a unified statistical strategy to perform modelling, estimation, inference, and signal characterization under background mismodelling. The method proposed allows to incorporate the (partial) scientific knowledge available on the background distribution and provides a data-updated version of it in a purely nonparametric fashion without requiring the specification of prior distributions on the unknown parameters. Applications in the context of dark matter searches and radio surveys show how the tools presented in this article can be used to incorporate non-stochastic uncertainty due to instrumental noise and to overcome violations of classical distributional assumptions in stacking experiments.

pacs:

02.30.Nw,02.70.Rr,03.65.Db,06.20.Dk,07.05.Kf,12.40.Ee,12.60.-i,14.80.Cp ,98.70.Vc.

I Introduction

When searching for new physics, a discovery claim is made if the data collected by the experiment provides sufficient statistical evidence in favor of the new phenomenon. If the background and signal distributions are specified correctly, this can be done by means of statistical tests of hypothesis, upper limits and confidence intervals.

The problem. In practice, even if a reliable description of the signal distribution is available, providing accurate background models may be challenging, as the behavior of the sources which contribute to it is often poorly understood. Some examples include searches for nuclear recoils of weakly interacting massive particles over electron recoils backgrounds aprile18 , agnese18 , searches for gravitational-wave signals over non-Gaussian backgrounds from stellar-mass binary black holes smith18 , and searches for a standard model-like Higgs boson over prompt diphoton production CMS18 .

Unfortunately, model uncertainty due to background mismodelling can significantly compromise the sensitivity of the experiment under study. Specifically, overestimating the background distribution in the signal region increases the chances of missing new physics. Conversely, underestimating the background outside the signal region leads to an artificially enhanced sensitivity, which can easily result in false discovery claims. Several methods have been proposed in literature to address this problem [e.g., yellin, , Priel, , dauncey, ]. However, to the best of the author’s knowledge, none of the methods available provides a unified strategy to (i) assess the validity of existing models for the background, (ii) fully characterize the background distribution, (iii) perform signal detection even if the signal distribution is not available, (iv) characterize the signal distribution, and (v) detect additional signals of new unexpected sources.

Goal. The aim of this work is to integrate modelling, estimation, and inference under background mismodelling and provide a general statistical methodology to perform of (i)-(v). As a brief overview, given a source-free sample and the (partial) scientific knowledge available on the background distribution, a data-updated version of it is obtained in a purely nonparametric fashion without requiring the specification of prior distributions on the unknown parameters. At this stage, a graphical tool is provided in order to assess if and where significant deviations between the true and the postulated background distributions occur. The “updated” background distribution is then used to assess if the distribution of the data collected by the experiment deviates significantly from the background model. Also in this case, it is possible to assess graphically how the data distribution deviates from the expected background model. If a source-free sample is available, or if control regions can be identified, the solution proposed does not require the specification of a model for the signal; however, if the signal distribution is known (up to some free parameters), the latter can be used to further improve the accuracy of the analysis and to detect the signals of unexpected new sources. Finally, the method can be easily adjusted to cover situations in which a source-free sample or control regions are not available, the background is unknown, or incorrectly specified, but a functional form of the signal distribution is known.

The key of the solution. The statistical methodologies involved rely on the novel LP approach to statistical modelling first introduced by Mukhopadhyay and Parzen in 2014 LPapproach . As it will become clearer later on in the paper, the letter L typically denotes robust nonparametric methods based on quantiles, whereas P stands for polynomials [ksamples, , Supp S1]. This approach allows the unification of many of the standard results of classical statistics by expressing them in terms of quantiles and comparison distributions and provides a simple and powerful framework for statistical learning and data analysis. The interested reader is directed to LPmode , LPBayes , LPtime , LPFdr , LPdec and references therein, for recent advancements in mode detection, nonparametric time series, goodness-of-fit on prior distributions, and large-scale inference using an LP approach.

Organization. Section II is dedicated to a review of the main constructs of LP modelling. Section III highlights the practical advantages offered by modelling background distributions using an LP approach. Section IV introduces a novel LP-based framework for statistical inference. Section V outlines the main steps of a data-scientific approach for signal detection and characterization. In Section VI, the methods proposed are applied in the context of dark matter searches where the goal is to distinguish $\gamma$ -ray emissions due to dark matter from those due to pulsars. In Section VII, the tools discussed are applied to a simulation of the Fermi Large Area Telescope $\gamma$ -ray telescope and it is shown how upper limits and Brazil plots can be constructed by means of comparison distributions. Section VIII is dedicated to model-denoising. Section IX presents an application to data from the NVSS astronomical survey and discusses a simple approach to assess the validity of distributional assumptions on the polarized intensity in stacking experiments. A discussion of the main results and extensions is proposed in Section X.

II LP Approach to Statistical Modelling

The LP Approach to Statistical Modelling [LPapproach, ] is a novel statistical approach which provides an ideal framework to simultaneously assess the validity of the scientific knowledge available and fill the gap between the initial scientific belief and the evidence provided by the data. Sections II.1, II.2 and II.3 below introduce the LP modelling framework, whereas Section III discusses how the problem of background mismodelling can be formulated under this paradigm.

II.1 The skew-G density model

Let $X$ be a continuous random variable with cumulative distribution function (cdf) and probability density function (pdf) $F(x)$ . Since $F$ is the true distribution of the data, it is typically unknown. However, suppose a suitable cdf $G(x)$ is available, and let $g(x)$ be the respective pdf. In order to understand if $G$ is a good candidate for $F$ , it is convenient to express the relationship among the two in a concise manner.

The skew-G density model [LPapproach, , LPmode, ] is a universal representation scheme which allows to express any pdf $f(x)$ as

[TABLE]

where $d(u;G,F)$ is called comparison density manny2 and it is such that

[TABLE]

with $u=G(x)$ and $G^{-1}(u)=\inf\{x:G(x)\geq u\}$ denoting the “postulated” quantile function of $X$ . The comparison density is the pdf of the random variable $U=G(X)$ ; whereas, its cdf is given by

[TABLE]

and it is called comparison distribution.

Practical remarks. Equations (2) and (3) are of fundamental importance to understand the power of a statistical modelling approach based on the comparison density. Specifically, $d(u;G,F)$ allows to “connect” any given pdf $g$ to the true pdf $f$ through the quantile transformation $G^{-1}$ of $u$ . Furthermore, $g\equiv f$ if and only if $d(u;G,F)=1$ for all $u\in[0,1]$ , i.e., $U$ is uniformly distributed over the interval $[0,1]$ . Whereas, if $g\not\equiv f$ , $d(u;G,F)$ models the departure of the true density $f(x)$ from the postulated model $g(x)$ . Consequently, an adequate estimate of $d(u;G,F)$ , not only leads to an estimate of the true $f(x)$ based on (1), but it also allows to identify the regions where $f(x)$ deviates substantially from $g(x)$ .

II.2 LP skew-G series representation

Denote with $L_{2}[0,1]$ the Hilbert space of square integrable functions on the unit interval with respect to the measure $G$ . A complete, orthonormal basis of functions in $L_{2}[0,1]$ can be constructed considering powers of $G(x)$ , i.e., $G(x),G^{2}(x),G^{3}(x),\dots$ and adequately orthonormalized via Gram-Schmidt procedure LPmode . The resulting bases can equivalently be expressed as normalized shifted Legendre Polynomials,111Classical Legendre polynomials are defined over $[-1,1]$ ; here, their “shifted” counterpart over the range $[0,1]$ is considered. The first three normalized shifted Legendre polynomials are: $Leg_{0}(u)=1$ , $Leg_{1}(u)=\sqrt{12}(u-0.5)$ , $Leg_{2}(u)=\sqrt{5}(6u^{2}-6u+1)$ , etc. namely $Leg_{j}(u)$ , with $u=G(x)$ .

Under the assumption that (2) is a square integrable function on $[0,1]$ , i.e., $d\in L_{2}[0,1]$ , we can then represent $d(u;G,F)$ via a series of $\{Leg_{j}(u)\}_{j\geq 0}$ polynomials, i.e.,

[TABLE]

with coefficients $LP_{j}=\int_{0}^{1}Leg_{j}(u)d(u;G,F)\partial u$ . The representation in (4) is called LP skew-G series representation [LPmode, ].

II.3 LP density estimate

Let $x_{1},\dots,x_{n}$ be a sample of independent and identically distributed (i.i.d.) observations from $X$ . Observations from $U$ are given by $u_{1}=G(x_{1}),\dots,u_{n}=G(x_{n})$ . The $LP_{j}$ coefficients in (4) can then be estimated via

[TABLE]

Aternatively, in virtue of (3), the estimates $\widehat{LP}_{j}$ can also be specified as

[TABLE]

where $\tilde{F}$ and $\tilde{D}$ denote the empirical distribution of the samples $x_{1},\dots,x_{n}$ and $u_{1},\dots,u_{n}$ , respectively.

The moments of the $\widehat{LP}_{j}$ are

[TABLE]

where $\sigma^{2}_{j}=\int_{0}^{1}(Leg_{j}(u)-LP_{j})^{2}d(u;G,F)\partial u$ and $\sigma^{2}_{jk}=\int_{0}^{1}(Leg_{j}(u)-LP_{j})(Leg_{k}(u)-LP_{k})d(u;G,F)\partial u$ . When $f\equiv g$ , the equalities in (7) reduce to

[TABLE]

for all $j\neq k$ . Derivations of (7) and (8) are discussed in Appendix A.

If (4) is approximated by the first $M+1$ terms,222Recall that the first normalized shifted Legendre polynomial is $Leg_{0}(u)=1$ . an estimate of the comparison density is given by

[TABLE]

with variance

[TABLE]

See Appendix B for more details on the derivation of (10). Finally, the standard error of $\widehat{d}(u;G,F)$ corresponds the square root of (10), with $\sigma^{2}_{j}$ and $\sigma_{jk}$ estimated by their sample counterpart, i.e.,

[TABLE]

Finally, in virtue of the skew-G density model in (1) we can estimate $f(x)$ as

[TABLE]

Since each $Leg_{j}(u)$ is a polynomial function of the random variable $U$ , each $\widehat{LP}_{j}$ estimate can be expressed as a linear combination of the first $j$ sample moments of $U$ , e.g.,

[TABLE]

where $\widehat{\mu}_{2}=\frac{1}{n}\sum_{i=1}^{n}u_{i}^{2}$ , $\widehat{\mu}_{1}=\frac{1}{n}\sum_{i=1}^{n}u_{i}$ . Therefore, the truncation point $M$ can be interpreted as the order of the highest moment considered to characterize the distribution of $U$ . (The reader is directed to Section IV.3 for a discussion on the choice of $M$ .)

II.4 The bias variance trade-off

In order to understand how good (9) is in estimating $d(u;G,F)$ we consider the Mean Integrated Squared Error (MISE) of $\widehat{d}(u;G,F)$ , i.e.,

[TABLE]

where the first term in (13) corresponds to the integral of the (10) over $[0,1]$ ; whereas the second term corresponds to the Integrated Squared Bias (IBS), i.e,

[TABLE]

Interestingly, the latter can also be specified as

[TABLE]

(see derivations in Appendix B). The first term on the right hand side of (15) is particularly important in understanding the role played by $g$ in obtaining a reliable estimate of $f$ . Specifically, the closer $g$ is to $f$ the lower the bias of $\widehat{d}(G(x);G,F)$ and $\widehat{f}(x)$ in (11).

Practical remarks. Equation (13) implies that larger values of $M$ do not necessarily lead to better estimates of $d(u;G,F)$ . Specifically, when $n\rightarrow\infty$ , the first term in (13) tends to zero. However, for large values of $M$ , more and more terms to contribute to it and thus increasing $M$ may lead to a substantial inflation of the variance in (10). Conversely, the bias is not affected by sample size and it can be controlled by either choosing $g$ sufficiently close to $f$ (see (15)) and/or increasing $M$ while preserving a good bias-variance trade-off.

Further remarks. Equation (6) implies that the estimator in (9) relies on the empirical distribution of the sample observed by means of the $\widehat{LP}_{j}$ estimates. Therefore, an estimator of the comparison density based entirely on the empirical cdf can be expressed by setting $M=n-1$ in (9). However, as discussed in this section, while this would reduce the bias, it would also icrease the variance drastically. Therefore, for $M<n-1$ , the estimator in (9), not only leads to a reduction of the variance but, in virtue of (15), its bias is mitigated when the postulated model $g$ is sufficiently close to the true pdf of the data $f$ .

III Data-driven corrections for misspecified background models

Let $\bm{x}_{\text{B}}=(x_{1},\dots,x_{N})$ be a sample of observations from control regions or the result of Monte Carlo simulations, where we expect no signal to be present. Hereafter, we refer to $\bm{x}_{\text{B}}$ as the source-free sample. Therefore, $\bm{x}_{\text{B}}$ can be used to “learn” the unknown pdf of the background, namely $f_{b}(x)$ , and obtain an estimate for it via (11).

Despite the true background model being unknown, suppose that a candidate pdf, namely $g_{b}(x)$ , is available. The candidate model $g_{b}(x)$ can be specified from previous experiments or theoretical results or can be obtained by fitting specific functions (e.g., polynomial, exponential, etc.) to $\bm{x}_{\text{B}}$ . If $g_{b}(x)$ does not provide an accurate description of $f_{b}(x)$ , the sensitivity of the experiment can be strongly affected.

Consider, for instance, a source-free sample of $N=5000$ observations whose true (unknown) distribution corresponds to the tail of a Gaussian with mean $55$ and width $15$ over the range $[0,50]$ , i.e.,

[TABLE]

with $k_{fb}=\int_{0}^{50}e^{-\frac{1}{2}\bigl{(}\frac{x-55}{15}\bigl{)}^{2}}\partial x$ . Suppose that a candidate model for the background is obtained by fitting a second-degree polynomial on the source-free sample and adequately normalizing it in order to obtain a proper pdf, i.e.,

[TABLE]

with $k_{gb}=\int_{0}^{50}[9.52-2.22x+0.15x^{2}]\partial x$ . For illustrative purposes, assume that the distribution of the signal is a Gaussian centered at $25$ , with width $4.5$ and pdf

[TABLE]

with $k_{fs}=\int_{0}^{50}e^{-\frac{1}{2}\bigl{(}\frac{x-25}{4.5}\bigl{)}^{2}}\partial x$ . The histogram of the source-free sample along with (16)-(18) is shown in Fig. 1. At the higher end of the spectrum, the postulated background (red dashed line) underestimates the true background distribution (green solid line). As a result, using (17) as background model increases the chance of false discoveries in this region. Conversely, at the lower end of the spectrum, $g_{b}(x)$ underestimates $f_{b}(x)$ , reducing the sensitivity of the analysis. For the sake of comparison, a Kernel density estimate (orange dot-dashed line) has been computed by selecting the bandwidth parameter as recommended in sheater . The latter exhibits substantial bias at the boundary and appears to overfit the data sample.

It is important to point out that, the discrepancy of $f_{b}(x)$ from $g_{b}(x)$ is typically due to the fact that the specific functional form imposed (in our example, a second-degree polynomial) is not adequate for the data. Thus, changing the values of the fitted parameters (or assigning priors to them) is unlikely to solve the problem. However, it is possible to “repair” $g_{b}(x)$ and obtain a suitable estimate of $f_{b}(x)$ by means of (11). Specifically, $f_{b}(x)$ can be estimated via

[TABLE]

where $\widehat{d}(G_{b}(x);G_{b},F_{b})$ is the comparison density estimated via (9) on the sample $G_{b}(x_{1}),\dots,G_{b}(x_{N})$ , whereas $F_{b}$ and $G_{b}$ are the true and the postulated background distributions, with pdfs as in (16) and (17), respectively.

In our example, choosing $M=2$ (see Section IV.3), we obtain

[TABLE]

where $Leg_{1}[G_{b}(x)]$ and $Leg_{2}[G_{b}(x)]$ are the first and second normalized shifted Legendre polynomials evaluated at $G_{b}(x)$ . Notice that, by combining (19) and (20), we can easily write the background model using of a series of shifted Legendre polynomials. This may be especially useful when dealing with complicated likelihoods and for which a functional form is difficult to specify.

The upper panel of Fig. 1 shows that the “calibrated” background model in (19) as a purple dot-dashed line and matches almost exactly the true background density in (16) (green solid line). The plot of $\widehat{d}(G_{b}(x);G_{b},F_{b})$ in the bottom panel of Fig. 1 provides important insights on the deficiencies of (17) as a candidate background model. Specifically, the magnitude and the direction of the departure of $\widehat{d}(G_{b}(x);G_{b},F_{b})$ from one corresponds to the estimated departure of $f_{b}(x)$ from $g_{b}(x)$ for each value of $x$ . Therefore, if $\widehat{d}(G_{b}(x);G_{b},F_{b})$ is below one in the region where we expect the signal to occur, using $\widehat{f_{b}}(x)$ in place of $g_{b}(x)$ increases the sensitivity of the analysis. Conversely, if $\widehat{d}(G_{b}(x);G_{b},F_{b})$ is above one outside the signal region, the use of $\widehat{f_{b}}(x)$ instead of $g_{b}(x)$ prevents from false discoveries.

Notice that in this article we only consider continuous data. In this respect, the goal is to learn the model of the background considered as a continuum and no binning is applied. Therefore, the histograms presented here are only a graphical tool used to display the data distribution and are not intended to represent an actually binning of the data.

IV LP-based inference

When discussing the skew-G density model in (1), we have witnessed that $f\equiv g$ if $d(u;G,F)=1$ for all $u\in[0,1]$ . Additionally, the graph of $\widehat{d}(u;G,F)$ provides an exploratory tool to understand the nature of the deviation of $f(x)$ from $g(x)$ . This section introduces a novel inferential framework to test the significance of the departure of $f(x)$ from $g(x)$ . Specifically, our goal is to test the hypotheses

[TABLE]

First, an overall test, namely the deviance test, is presented. The deviance test assesses if $f(x)$ deviates significantly from $g(x)$ anywhere over the range of $x$ considered. Second, adequate confidence bands are constructed in order to assess where significant departures occur.

IV.1 The deviance test

Recall that the $LP_{j}$ coefficients in (4) specify as $LP_{j}=\int^{1}_{0}Leg_{u}(d)d(u;G,F)\partial u$ . Consequently, by orthogonality of the $\{Leg_{j}(u)\}_{j}>0$ polynomials and $Leg_{0}(u)=1$ , when $H_{0}$ in (21) is true all the $LP_{j}$ coefficients are equal to zero, including the first $M$ of them. We can then quantify the departure of $\widehat{d}(u;G,F)$ from one by means of the deviance statistics LPFdr which specifies as $\sum_{j=1}^{M}\widehat{LP}^{2}_{j}$ . If the deviance is equal to zero, we may expect that $g$ is approximately equivalent to $f$ ; hence, we test

[TABLE]

by means of the test statistic

[TABLE]

It can be shown LPmode that, as $n\rightarrow\infty$

[TABLE]

where $\xrightarrow{d}$ denotes convergence in distribution, and thus, under $H_{0}$ , $D_{M}$ is asymptotically $\chi^{2}_{M}$ -distributed. Hence, an asymptotic p-value for (22) is given by

[TABLE]

where $d_{M}$ is the value of $D_{M}$ observed on the data.

Practical remarks. Notice that $H_{1}$ in (22) implies $H_{1}$ in (21). Similarly, $H_{0}$ in (21) implies $H_{0}$ in (22); however, the opposite is not true in general since there may be some non-zero $LP_{j}$ coefficients for $j>M$ . Therefore, even when choosing $M$ small may lead to conservative, but yet valid, inference.

IV.2 Confidence bands

The estimator in (9) only accounts for the first $M+1$ terms of the polynomial series in (4). Therefore, $\widehat{d}(u;G,F)$ is a biased estimator of $d(u;G,F)$ . Specifically, as discussed in Section II.4, the integrated bias is given by $\sum_{j>M}LP^{2}_{j}$ , whereas, as show in Appendix B, the bias at a given point $u$ is given by $\sum_{j>M}LP_{j}Leg(u)$ . It follows that, when the bias is large, confidence bands based on $\widehat{d}(u;G,F)$ are shifted away from the true density $d(u;G,F)$ .

Despite the bias cannot be easily quantified in the general setting, it follows from (8) that, when $H_{0}$ in (21) (and consequently $H_{0}$ in (22)) is true, both the bias at a point $u$ and the integrated bias are equal to zero. Thus, we can exploit this property to construct reliable confidence bands under the null. Specifically, the goal is to identify $c_{\alpha}$ , such that

[TABLE]

where $\alpha$ is the desired significance level.333In astrophysics, the statistical significance $\alpha$ is often expressed in terms of number of $\sigma$ -deviations from the mean of a standard normal, namely $\sigma$ . For instance, a 2 $\sigma$ significance corresponds to $\alpha=1-\Phi(2)=0.0227$ , where $\Phi(\cdot)$ denotes the cdf of a standard normal.

If the bias determines where the confidence bands are centered, the distribution and the variance of $\widehat{d}(u;G,F)$ determine their width. As discussed in Section II.3 (see (8)), under $H_{0}$ in (21), the $\widehat{LP}_{j}$ estimates have mean zero, variance $\frac{1}{n}$ and they are uncorrelated one another. Therefore, when $f\equiv g$ , the standard error of $\widehat{d}(u;G,F)$ , corresponds to the square root of (10) with $\sigma^{2}_{j}=1$ and $\sigma^{2}_{jk}=0$ i.e.,

[TABLE]

Additionally, (24) implies that $\widehat{d}(u;G,F)$ is asymptotically normally distributed, hence

[TABLE]

as $n\rightarrow\infty$ , for all $u\in[0,1]$ , under $H_{0}$ .

We can then construct approximate confidence bands under $H_{0}$ which satisfy (26) by means of tube formulae (see [larry, , Ch.5] and PL05 ), i.e.,

[TABLE]

where $c_{\alpha}$ is the solutions of

[TABLE]

with $k_{0}=\sqrt{\sum^{M}_{j=1}[\frac{\partial}{\partial u}Leg_{j}(u)]^{2}}$ . If $\widehat{d}(u;G,F)$ is within the bands in (29) over the entire range $[0,1]$ , we conclude that there is no evidence that $f$ deviates significantly from $g$ anywhere over the range considered and at confidence level $1-\alpha$ . Conversely, we expect significant departures to occur in regions where $\widehat{d}(u;G,F)$ lies outside the confidence bands.

Practical remarks. Notice that, under $H_{0}$ in (22), the $\widehat{d}(u;G,F)$ is an unbiased estimator of ${d}(u;G,F)$ , regardless of the choice of $M$ . This implies that the confidence bands in (29) are only affected by the variance and asymptotic distribution of $\widehat{d}(u;G,F)$ under $H_{0}$ .

IV.3 Choice of $M$

The number of $\widehat{LP}_{j}$ estimates considered determines the level of “smoothness” 444 As an anonymous referee correctly pointed out, $\widehat{d}(u;G,F)$ is always smooth as it is constructed as a series of infinitely differentiable functions. In statistics, however, the word “smoothness” is often used to indicate the flexibility of the estimator considered or, in other words, its degrees of freedom. Often, this is quantified in terms of magnitude of the second derivative of the function considered. Despite the abuse of terminology, throughout the manuscript we will refer to the latter definition of smoothness. of $\widehat{d}(u;G,F)$ , with smaller values of $M$ leading to smoother estimates. The deviance test can be used to select the value $M$ which maximizes the sensitivity of the analysis according to the following scheme:

i.

Choose a sufficiently large value $M_{\max}$ . 2. ii.

Obtain the estimates $\widehat{LP}_{1},\dots,\widehat{LP}_{M_{\max}}$ as in (5). 3. iii.

For $m=1,\dots,M_{\max}$ :

calculate the deviance test p-value as in (25), i.e.,

[TABLE]

with $d_{m}=n\sum_{j=1}^{m}\widehat{LP}^{2}_{j}$ . 4. iv.

Choose $M$ such that

[TABLE]

IV.3.1 Adjusting for post-selection

As any data-driven selection process, the scheme presented above affects the distribution of (9) and can yield to overly optimistic inference xiaotong , potscher . Despite this aspect being often ignored in practical applications, correct coverage can only be guaranteed if adequate corrections are implemented.

The issues arising in the context of post-selection inference can be interpreted in terms of looks-elsewhere effect gv10 , meJINST where one has to adjust the inference for the fact that, in practice, many different models have been considered and, consequently, many different tests have been conducted for the sake of assessing the goodness of fit.

In our setting, the number of models under comparison is typically small ( $M_{\max}\leq 20$ ); therefore, post-selection inference can be easily adjusted by means of Bonferroni’s correction bonferroni35 . Specifically, the adjusted deviance p-value is given by

[TABLE]

where $M$ is the value selected via (32), whereas confidence bands can be adjusted by substituting $c_{\alpha}$ in (29), with $c_{\alpha,M_{\max}}$ satisfying

[TABLE]

Practical remarks. As noted in Section II, the estimate (9) involves the first $M$ sample moments of $U$ ; therefore, $M_{\max}$ can be interpreted as the order of the highest moment which we expect to contribute in discriminating the distribution of $U$ from uniformity. Notice that, in addition to the inflation of the variance of (9), when $M$ is large, the computation of normalized shifted Legendre of higher order may face numerical instability (see Section VIII.2). Therefore, as a rule of thumb, $M_{\max}$ is typically chosen $\leq 20$ . Finally, Steps i-iv aim to select the approximant based on the first most significant $M$ moments, while excluding powers of higher order. A further note on model-denoising is given in Section VIII.

V A data-scientific approach to signal searches

The tools presented in Sections II and IV provide a natural framework to simultaneously

(a)

assess the validity of the postulated background model and, if necessary, update it using the data (Section III);

(b)

perform signal detection on the physics sample;

(c)

characterize the signal when a model for it is not available.

Furthermore, if the model for the signal is known (up to some free parameters), it is possible to

(d)

further refine the background or signal distribution;

(e)

detect hidden signals from new unexpected sources.

Notice that, since Bonferroni’s correction leads to an upper bound for the overall significance, the resulting coverage will be higher than the nominal one. Alternatively, approximate post-selection confidence bands and inference can be constructed using Monte Carlo and/or resampling methods and repeating the selection process at each replicate.

Tasks (a)-(e) can be tackled in two main phases. In the first phase, the postulated background model is “calibrated” on a source-free sample in order to improve the sensitivity of the analysis and reduce the risk of false discoveries. The second phase focuses on searching for the signal of interest and involves both a nonparametric signal detection stage and a semiparametric stage for signal characterization. Both phases and respective steps are described in details below and summarized in Algorithm 1.

V.1 Background calibration

As discussed in Section III, deviations of $\widehat{d}(G_{b}(x);G_{b},F_{b})$ from one suggest that a further refinement of the candidate background model $g_{b}$ is needed. However, as $M$ increases, the deviations of $\widehat{d}(G_{b}(x);G_{b},F_{b})$ from one may become more and more prominent while the variance inflates. Thus, it is important to assess if such deviations are indeed significant. In order address this task, the analysis of Section III can be further refined in light of the inferential tools introduced in Section IV.

For the toy example discussed in Section III, we have seen that $g_{b}$ overestimates $f_{b}$ in the signal region and underestimates it at the higher end of the range considered (Fig. 1). We can now assess if any of these deviations are significant by implementing the deviance test in (23)-(25), whereas, to identify where the most significant departures occur, we construct confidence bands under the null model as in (29), i.e., assuming that no “update” of $g_{b}$ is necessary.

The results are collected in the comparison density plot or CD plot presented in Fig. 2. First, a value $M=2$ has been selected as in (32), and the respective deviance test (adequately adjusted via Bonferroni) indicates that the deviation of $f_{b}$ from $g_{b}$ is significant at a $6.430\sigma$ significance level (adjusted p-value of $6.397\cdot 10^{-11}$ ). Additionally, the estimated comparison density in (20) lies outside the $2\sigma$ confidence bands in the region $[0,50]$ where the signal specified in (18) is expected to occur. Hence, using (19) instead of (17) is recommended in order to improve the sensibility of the analysis in the signal region.

Important remarks on the CD plot. When comparing different models for the background or when assessing if the data distribution deviates from the model expected when no signal is present, it is common practice to visualize the results of the analysis by superimposing the models under comparison to the histogram of the data observed on the original scale (e.g., upper panel of Fig. 1). This corresponds to a data visualization in the density domain. Conversely, the CD plot (e.g., Fig. 2) provides a representation of the data in the quantile domain, which offers the advantage of connecting the true density of the data with the quantiles of the postulated model (see (2)-(3)). Consequently, the most substantial departures of the data distribution from the expected model are magnified, and those due to random fluctuations are smoothed out (see, also, Section VII.2). Furthermore, the deviance tests and the CD plot together provide a powerful goodness-of-fit tool and exploratory which, conversely from classical methods such as Anderson-Darling anderson and Kolmogorov-Smirnov darling , not only allow to test if the distributions under comparison differ, but they also allow to assess how and where they differ. As a result, the CD plot can be used to characterize the unknown signal distribution (see Section V.2.2) and to identify exclusion regions (e.g., Case I in Section V.2.1).

As an additional advantage, the deviance test appears to enjoy higher detection power than classical approaches. This aspect is highlighted in Table 1 where several methods for goodness of fit or two-samples comparisons are implemented, along with the deviance test, for all the cases discussed in Section V.

Reliability of the calibrated background model. The size $N$ of the source-free sample plays a fundamental role in the validity of $\widehat{f}_{b}(x)$ as a reliable background model. Specifically, the randomness involved in (19) only depends on the $\widehat{LP}_{j}$ estimates. If $N$ is sufficiently large, by the strong law of large numbers,

[TABLE]

Therefore, despite the variance of $\widehat{f}_{b}(x)$ becoming negligible as $N\rightarrow\infty$ , one has to account for the fact that $\widehat{f}_{b}(x)$ leads to a biased estimate of $f_{b}(x)$ when $f_{b}\not\equiv g_{b}$ (see Section II.4). For sufficiently smooth densities, a visual inspection is often sufficient to assess if $\widehat{d}(u;G_{b},F_{b})$ (and, consequently, $\widehat{f}_{b}(x)$ ) provides a satisfactory fit for the data, whereas, for more complex distributions the effect of the bias can be mitigated considering larger values of $M$ and model-denoising (see Section VIII.1).

V.2 Signal search

V.2.1 Nonparametric signal detection

The background calibration phase allows the specification of a well tailored model for the background, namely $\widehat{f}_{b}(x)$ , which simultaneously integrates the initial guess, $g_{b}$ , and the information carried by the source-free data sample. Hereafter, we disregard the source-free data sample and focus on analyzing the physics sample.

Under the assumption that the source-free sample has no significant source contamination, we expect that, if the signal is absent, both the source-free and the physics sample follow the same distribution. Therefore, the calibrated background model, $\widehat{f}_{b}(x)$ , plays the role of the postulated distribution for the physics sample, i.e., the model that we expect the data to follow when no signal is present; hence, we set $g(x)=\widehat{f}_{b}(x)$ .

Let $f(x)$ be the (unknown) true pdf of the physics sample which may or may not carry evidence in favor of the source of interest. When no model for the signal is specified, it is reasonable to consider any significant deviation of $f$ from $g$ as an indication that a signal of unknown nature may be present. In this setting, similarly to the background calibration phase, we can construct deviance tests and CD plots to assess if and where significant departures of $f$ from $g$ occur. Two possible scenarios are considered – a physics sample which collects only background data (Case I) and a physics sample of observations from both background and signal (Case II).

Case I: background-only. Let $\bm{x}$ be a physics sample of $n=1300$ observations whose true (unknown) pdf $f(x)$ is equivalent to $f_{b}(x)$ in (16). We set

[TABLE]

where $g_{b}(x)$ and $\widehat{d}(G_{b}(x);G_{b},F_{b})$ are defined as in (17) and (20), respectively. The resulting CD plot and deviance test are reported in the left panel of Fig. 3.

When applying the scheme in Section IV.3 with $M_{\max}=20$ , none of the values of $M$ considered leads to significant results; therefore, for the sake of comparison with Case II below, we choose $M=4$ . Not surprisingly, the estimated comparison density approaches one over the entire range and lies entirely within the confidence bands. This suggests that the true distribution of the data does not differ significantly from the model which accounts only for the background. Similarly, the deviance test leads to very low significance (adjusted p-value $>1$ ); hence, we conclude that our physics sample does not provide evidence in favor of the new source.

Case II: background + signal. Let $\bm{x}$ be a physics sample of $n=1300$ observations whose true (unknown) pdf $f(x)$ is equal to $f_{bs}(x)$ in (36)

[TABLE]

with $f_{b}(x)$ and $f_{s}(x)$ defined as in (16) and (18) respectively, and $\eta=0.15$ . The histogram of the data and the graph of $f_{bs}(x)$ are plotted in Fig. 4. As in Case I, we set $g(x)$ as in (35).

The CD plot and deviance test in the right panel of Fig. 3 show a significant departure of the data distribution from the background-only model in (35). The maximum significance of the deviance is achieved at $M=4$ , leading to a rejection of the null hypothesis at a $11.611\sigma$ significance level (adjusted p-value $=1.799\cdot 10^{-31}$ ). The CD plot shows a prominent peak at the lower end of the spectrum; hence, we conclude that there is evidence in favor of the signal, and we proceed to characterize its distribution as described in Section V.2.2.

V.2.2 Semiparametric signal characterization

The signal detection strategy proposed in Section V.2.1 does not require the specification of a distribution for the signal. However, if a model for the signal is known (up to some free parameters), the analysis can be further refined by providing a parametric estimate of the comparison density and assessing if additional signals from new unexpected sources are present.

Case IIa: background + (known) signal.** ** Assume that a model for the signal, $f_{s}(x,\bm{\theta}_{s})$ , is given, with $\bm{\theta}_{s}$ being a vector of unknown parameters. Since the CD plot in the right panel of Fig. 3 provides evidence in favor of the signal, we expect the data to be distributed according to the pdf

[TABLE]

where $\widehat{f}_{b}(x)$ is the calibrated background distribution in (35) and $\eta$ and $\bm{\theta}_{s}$ can be estimated via Maximum Likelihood (ML). Letting $\widehat{\eta}$ and $\widehat{\bm{\theta}}_{s}$ be the ML estimates of $\eta$ and $\bm{\theta}_{s}$ respectively, we specify

[TABLE]

as postulated model. For simplicity, let $f_{s}$ to be fully specified as in (18); we construct the deviance test and the CD plot to assess if (38) deviates significantly from the true distribution of the data. The scheme in Section IV.3 has been implemented with $M_{\max}=20$ , and none of the values of $M$ considered led to significant results. The CD plot and deviance test for $M=4$ are reported in the upper left panel of Fig. 5. Both the large p-value of the deviance test (adjusted p-value $>1$ ) and the CD plot suggest that no significant deviations occur; thus, (38) is a reliable model for the physics sample.

Moreover, we can use (38) to further refine our $\widehat{f}_{b}(x)$ or $f_{s}(x,\widehat{\bm{\theta}}_{s})$ distributions. Specifically, we first construct a semiparametric estimate of $d(G(x);G,F)$ , i.e.,

[TABLE]

and rewrite

[TABLE]

In the upper right panel of Fig. 5, the true comparison density (grey dashed line) of our physics sample is compared with its semiparametric estimate computed as in (39) (pink dashed line) with $f_{s}(x,\widehat{\bm{\theta}}_{s})=f_{s}(x)$ in (18). The graphs of two nonparametric estimates of $d(u;G,F)$ computed via (9) with $M=4$ and $M=9$ (blue dot-dashed line and black dotted line), respectively, are added to the same plot. Not surprisingly, incorporating the information available on the signal distribution drastically improves the accuracy of the analysis. The semiparametric estimate matches $d(u;G,F)$ almost exactly, whereas both nonparametric estimates show some discrepancies from the true comparison density. All the estimates suggest that there is only one prominent peak in correspondence of the signal region.

When moving from the comparison density domain to the density domain in Fig. 4, the discrepancies between the nonparametric estimates and the true density $f(x)$ are substantially magnified. Specifically, when computing (9) and (11) with $M=4$ (blue dot-dashed line), the height signal peak is underestimated whereas, when choosing $M=9$ , the $\widehat{f}(x)$ exhibits high bias at the boundaries555Boundary bias is a common problem among nonparametric density estimation procedures [e.g., larry, , Ch.5, Ch.8]. When aiming for a non-parametric estimate of the data density $f(x)$ , solutions exists to mitigate this problem [e.g., efromovich, ]. (dotted black line).

Case IIb: background + (unknown) signal.** ** When the signal distribution is unknown, the CD plot of $\widehat{d}(u;G,F)$ can be used to guide the scientist in navigating across the different theories on the astrophysical phenomenon under study and specify a suitable model for the signal, i.e., $f_{s}$ . The model proposed can then be validated, as in Case IIa, by fitting (38) and constructing deviance tests and CD plots.

At this stage, the scientist has the possibility to iteratively query the data and explore the distribution of the signal by assuming different models. A viable signal characterization is achieved when no significant deviations of $\widehat{d}(u,G_{bs},F)$ from one are observed (e.g., see upper left panel of Fig. 5). Notice that a similar approach can be followed also in the background calibration stage (Section V.1) to provide a parametric characterization of the background distribution.

Case III: background + (known) signal + unexpected source.** ** The tools proposed so far can also be used to detect signals from unexpected sources whose pdfs are, by design, unknown.

Suppose that the physics sample $\bm{x}$ contains $n=1300$ observations whose true (unknown) pdf $f(x)$ is equal to $f_{bsh}(x)$

[TABLE]

where $f_{h}(x)$ is the pdf of the unexpected signal and assume its distribution to be normal with center at 37 and width 1.8. Let $f_{b}(x)$ and $f_{s}(x)$ be defined as in (16) and (18), respectively, and let $\eta_{1}=0.15$ and $\eta_{2}=0.1$ .

We can start with a nonparametric signal detection stage by setting $g(x)=\widehat{g}_{bs}(x)$ in (35), with $f_{s}$ defined as in (18) and $\widehat{\eta}$ estimated via MLE. The respective CD plot and deviance tests are reported in the bottom left panel of Fig. 5.

Choosing $M=9$ , as in (32), both the CD plot and deviance test indicate a significant departure from the expected background-only model and a prominent peak is observed in correspondence of the signal of interest centered around 25. A second but weaker peak appears to be right on the edge of our confidence bands, suggesting the possibility of an additional source. At this stage, if $f_{s}$ was unknown, we could proceed with a semiparametric signal characterization as in Case IIb. Whereas assuming that the distribution of the signal of interest is known and given by (18), we fit (38), aiming to capture a significant deviation in correspondence of the second bump. This is precisely what we observe in the bottom right panel of Fig. 5. Here the estimated comparison density deviates from (35) around 35, providing evidence in favor of an additional signal in this region. We can then proceed as in Case IIb by exploring the theories available and/or collecting more data to further investigate the nature and the cause of the unanticipated bump.

VI Signal detection without calibration sample and model selection

There are situations where a source-free sample is simply not available and thus the calibration phase in Section V.1 cannot be implemented. The tools described in Sections II and IV can, however, still be applied in order to perform signal detection and goodness-of-fit when a model for the signal is known, up to some free parameters. In this framework, we expect the data to either come only from the signal (with at most some negligible background contamination) or only from the background.

In order to illustrate how to proceed in this setting, we consider a dark matter search where the postulated model for dark matter $\gamma$ -ray emissions is the one of [bergstrom, , Eq. 29], i.e.,

[TABLE]

with $y\in[0.5,5]$ Teraelectron Volt (TeV), $M_{\chi}\in[0.5,5]$ TeV and $k_{M_{\chi}}$ is a normalizing constant. The goal is to show that, when considering a background-only sample, the method proposed correctly rejects (42) as suitable model for the data; whereas, when considering a dark matter sample, the dark matter model in (42) is “accepted”.

To further increase the complexity of the problem, we consider a situation where the background sample corresponds to $\gamma$ -ray emissions due to a pulsar, with distribution

[TABLE]

with $y_{0}=0.5$ , $y\in[0.5,5]$ TeV, $\tau>0$ and and $k_{\tau}$ is a normalizing constant. Notice that, as discussed in baltz , distinguishing $\gamma$ -ray emissions due to pulsars from those due to dark matter is a particularly challenging task. The histograms of the two datasets considered are shown in Figure 6; the overlapping curves correspond to the best fit of the models in (42) and (43) on each sample. Interestingly, for both samples, (42) and (43) provide a very similar fit to the data; hence the importance of correctly selecting the most adequate model or, excluding the dark matter hypothesis when observing emissions due to pulsars.

The upper panels of Figure 7 display the CD plots obtained by setting $g=g_{DM}$ in (42) as postulated model and comparing it with the distribution of the dark matter sample (upper left panel) and of the background pulsar sample (upper right panel). Remarkably, the CD plots and the adjusted deviance tests correctly lead to the conclusion that the distribution of the dark matter sample does not deviates significantly from (42), whereas the distribution of the pulsar sample does deviate substantially from (42) and the deviance test (adequately adjusted for post-selection inference) rejects the dark matter model with $3.897\sigma$ significance (adjusted p-value of $4.870\cdot 10^{-5}$ ). Notice that, in both cases, we are ignoring the information regarding the pulsar distribution and the only inputs considered are the data and the signal model in (42).

Finally, when incorporating the knowledge of the pulsar distribution in (43) into the analysis, one can select between the models in (42) and (43) by constructing additional CD plots and deviance test for both samples and setting $g=g_{PS}$ in (43). The results are shown in lower panels of Figure 7. As expected, the dark matter model is rejected (lower left panel) with $2.297\sigma$ significance (adjusted p-value of $0.0108$ ) whereas the pulsar model is “accepted” (lower right panel).

VII Background mismodelling due to instrumental noise and upper limits constructions

When conducting real data analyses one has to take into account that the data generating process is affected by both statistical and non-random uncertainty due to the instrumental noise. As a result, even when a model for the background is known, the data distribution may substantially deviate from it due to the smearing introduced by the detector [e.g., lyonsPHY, ]. In order to account for the instrumental error affecting the data, it is common practice to consider folded distributions where the errors due to the detector are often modelled assuming a normal distribution or estimated via non-parametric methods [e.g., PHY, , PHY2, ]. In Section VII.1, it is shown how the same approach described in Sections V.1 and V.2.1 can be used to assess if the instrumental error is negligible and, when not, how to update the postulated background model in order to incorporate the instrumental noise. Section VII.2 discusses upper limits constructions by means of comparison distributions.

VII.1 Modelling the instrumental error

The data considered come from a simulated observation by the Fermi Large Area Telescope atwood with realistic representations of the effects of the detector and present backgrounds meJINST , meMNRAS . The Fermi-LAT is a pair-conversion $\gamma$ -ray telescope on board the earth-orbiting Fermi satellite. It measures energies and images $\gamma$ -rays between about a 100 MeV and several TeV. The goal of the analysis is to assess if the data could result from the self-annihilation of a dark matter particle.

Let the distribution of the astrophysical background be a power-law, i.e.,

[TABLE]

where $k_{\phi}$ is a normalizing constant and $x\in[1,35]$ Giga electron Volt (GeV). Equation (44) corresponds to the distribution we would expect the background to follow if there was no smearing of the detector. The left panel of Figure 8 shows the histogram of a source-free sample of 35,157 i.i.d. observations from a power-law distributed background source with index 2.4 (i.e., $\phi=1.4$ in (44)) and contaminated by instrumental errors of unknown distribution.

In order to assess if (44) is a suitable distribution for these data, we proceed by fitting (44) via maximum likelihood and setting it as postulated background distribution. The best fit of (44) is displayed on the left panel of Figure 8 as a black dashed line.

We proceed estimating $d(G_{b}(x);G_{b},F_{b})$ and $f_{b}$ as in (9) and (19) respectively, with $M=4$ (chosen as in Section IV.3). The deviance test and CD plot are reported in the left panel of Figure 9 and suggest that significant departures from the fitted power-law model occur. This implies that the instrumental error is not negligible and thus, in order to account for it, we consider (45) as “calibrated” background density the model

[TABLE]

where $G_{b}(x)$ is the cdf of (44) and $\hat{\phi}=1.359$ is the ML estimate of $\phi$ in (44).

For the sake of comparison, the same analysis has been repeated considering $35,157$ i.i.d. observations from a power-law background source with index 2.4, without instrumental error. The respective CD plot and deviance test are shown on the right panel of Figure 9 and indicate that the power-law model in (44), with $\phi$ replace by its MLE (i.e., $\widehat{\phi}=1.391$ ), provides a good fit for the data, i.e., the instrumental error is, in this case, absent or negligible.

VII.2 Signal detection and upper limit construction

Once obtained a calibrated background distribution, we proceed with the signal detection phase by setting $g(x)=\widehat{f}_{b}(x)$ in (45). Similarly to Section V.2.1, two physics samples are given; one containing 200 observations from the background source distributed, as in (44), and the other containing 200 observations from a dark matter emission. The signal distribution from which the data have been simulated is the pdf of $\gamma$ -ray dark matter energies in [bergstrom, , Eq. 28] with $M_{\chi}=3.5$ . Both physics samples include the contamination due to the instrumental noise with unknown distribution. The respective histograms are shown in the right panel of Fig. 8.

The selection scheme in Section IV.3 suggests that no significant departure from (45) occurs on the background-only physics sample, whereas, for the signal sample, the strongest significance is observed at $M=3$ ; therefore, for the sake of comparison, we choose $M=3$ in both cases. The respective deviance tests and CD plots are reported in Fig. 10. As expected, the upper left panel of Fig. 10 shows a flat estimate of the comparison density on the background-only sample. Conversely, the upper right panel of Fig. 10 suggests that an extra bump is present over the $[2,3.5]$ region with $3.318\sigma$ significance (adjusted p-value = $4.552\cdot 10^{-4}$ ). As in (39), it is possible to proceed with the signal characterization stage (see Section V.2.2); however, in this setting, one has to account for the fact that also the signal distribution must include the smearing effect of the detector.

As an anonymous referee pointed out, it is important to discuss how upper limits and Brazil plot can be constructed via LP modelling and how they relate to the constructs discussed so far in this manuscript. Indeed, the confidence bands reported in the CD plots are themselves upper limits. Specifically, in the signal detection framework of Section V.2.1, the confidence bands in (29) are constructed assuming that there is no signal in the data. Specifically, they correspond to the regions where the comparison density estimator is expected to lie, at $1-\alpha$ confidence level, if the data includes background-only events. Conversely, any deviation from the confidence bands characterizes the quantiles of the distribution where the data distribution does not conform with the one postulated under the assumption that no signal is present.

When the interest is in identifying areas of the search region where deviations from the background model occur, one can exploit the fact that $u=G(x)$ , and thus upper limits and classical “Brazil plots” based on the comparison density can be obtained by plotting (20) and the respective confidence bands in (29) as a function of $x$ . This is shown, for our Fermi-LAT example in the bottom panels of Figure 10. Indeed, the upper and bottom panels in Figure 10 carry essentially the same information in two different domains. Specifically, the CD plots display the departure of $f$ from $g$ in the quantile domain whereas the Brazil plots show the same differences in the frequency domain. For signal detection purpose, the bottom panels may be preferred to identify the location where substantial deviations among the background and signal model occur. Whereas, the CD plots are more suitable for goodness-of-fit purposes as they provide a simulataneous visualization of the differences occurring at each quantile of the distribution.

VIII model-denoising

As discussed in Section II.4, the choice of $M$ affects the resulting estimator of $d(u;G,F)$ in terms of both bias and variance. When dealing with complex background distributions, a large value of $M$ may be necessary to reduce the bias of the estimated comparison density. At the same times, however, a large value of $M$ leads to an inflation of the variance. In other words, considering a basis of $M$ shifted Legendre polynomials may lead to overfitting.

Practically speaking, overfitting leads to wiggly (i.e., non-smooth) estimates and thus one may overcome this limitation by attempting to denoise the estimator in (9). Section VIII.1 reviews the model-denoising approach proposed by LPapproach , LPmode , whereas Section VIII.2 briefly discusses inference and model selection in this setting. Finally, Section VIII.3 compares the results obtained with a full and a denoised solution on the examples of Section V.

VIII.1 AIC denoising

Let $\widehat{LP}_{1},\dots,\widehat{LP}_{M}$ be the estimate of the first $M$ coefficients of the expansion in (4). The most “significant” ${LP}_{j}$ coefficients are selected by sorting the respective $\widehat{LP}_{j}$ estimates so that

[TABLE]

and choosing the value $k=1,\dots,M$ for which $AIC(k)$ in (46) is maximum

[TABLE]

The AIC-denoised estimator of $d(u;G,F)$ is given by

[TABLE]

where $\widehat{LP}_{(j)}$ is the estimate whose square is the $j^{\text{th}}$ largest among $\widehat{LP}^{2}_{1},\dots,\widehat{LP}^{2}_{M}$ , $Leg_{(j)}(u)$ is the respective shifted Legendre polynomial and

[TABLE]

Practical remarks. Recall that the first $M$ coefficients $LP_{j}$ can be expressed as a linear combination of the first $M$ moments of $U$ . Thus, the AIC-denoising approach selects the $LP_{j}$ coefficients which carry all the “sufficient” information on the first $M$ moments of the distribution.

VIII.2 Inference after denoising

The deviance test can be used, as in Section IV.3, to choose the size of the initial basis of $M$ polynomials among $M_{\max}$ possible models. Finally, the $k^{\star}_{M}$ largest coefficients are chosen by maximizing (46). This two-step procedure selects $\widehat{d}^{*}(u;G,F)$ in (47) from a pool of $M_{tot}=M_{\max}+\frac{M(M-1)}{2}$ possible estimators. Therefore, the Bonferroni-adjusted p-value of the deviance test is given by

[TABLE]

withe $d_{k_{M}^{*}}=\sum_{j=1}^{k_{M}^{*}}\widehat{LP}^{2}_{(j)}$ . Similarly, confidence bands can be constructed as

[TABLE]

where $c_{\alpha,M_{tot}}$ is the solutions of

[TABLE]

Practical remarks. Given the possibility of denoising our solution, one may legitimately wonder why not to consider a large value of $M_{\max}$ , e.g., $M_{\max}=100$ and then select $k^{\star}_{M_{\max}}$ directly. In other words, why should we first implement the procedure in Section IV.3 and, only after, refine our estimator as in Section VIII.1 and not vice-versa? There are two main reasons why such approach is discouraged.

First of all, one has to take into account that ignoring the selection stage proposed in Section IV.3, there is no guarantee that the resulting $k^{\star}_{M_{\max}}$ would include all the $\widehat{LP}_{j}$ terms that provide the strongest evidence in favor of $H_{1}$ in (22). Therefore, the resulting p-value can in principle be lower than the one in (49). Indeed the AIC criterion in (46), aims to improve the fit of the estimator to the data, whereas the deviance selection criteria in (32) aim to maximize the power of the inferential procedure.

Second, choosing $M_{\max}=100$ is computationally unfeasible with most of the standard programming languages such as R and Python, and the numerical computation of (9) may easily lead to divergent or inaccurate results.

VIII.3 Comparing full and denoised solution

Fig. 11 compares the fit of the estimators $\widehat{d}(u;G,F)$ and $\widehat{d}^{*}(u;G,F)$ for the examples in Section V. For all the cases considered, $M$ and $k^{*}_{M}$ have been selected as in (32) and (48) (see second column of Table 2). When no significance was achieved for any of the values of $M$ considered, a small basis of $M=3$ or $M=4$ polynomials was chosen for the full estimator $\widehat{d}(u;G,F)$ , which was then further denoised in order to obtain $\widehat{d}^{*}(u;G,F)$ . Table 2 shows the results of the deviance tests of the full and the denoised solution for the examples in Section V. The unadjusted p-values and the Bonferroni-adjusted p-values are reported in the second and third columns, respectively. In half of the cases, $k^{*}_{M}=M$ and the estimators $\widehat{d}(u;G,F)$ and $\widehat{d}^{*}(u;G,F)$ overlap over the entire range $[0,1]$ . The inferential results were also approximately equivalent in the majority of the situations considered.

The main differences are observed in the analysis of the background-only physics sample (Case I). In this case, the deviance-selection procedure leads to non-significant results for all the values of $M$ considered; the minimum p-value is observed at $M=18$ (unadjusted p-value = $0.2657$ ). In this setting, the denoising process leads to $k^{*}_{18}=2$ and the respective unadjusted p-value is $5.096\cdot 10^{-4}$ . This further emphasizes the importance of adjusting for model selection in order to avoid false discoveries. For modelling purposes and for the sake of comparison with the case where a signal is present, a basis of $M=4$ was selected. Since the true distribution of the data is the same as the postulated one, the denoising process sets all the coefficients equal to zero ( $k^{*}_{M}=0$ ).

For Case III, only $k^{*}_{M}=6$ out of $M=9$ coefficients are selected when denoising (see Table 2). Despite the right panel of Fig. 11 shows that the full and the denoised solution are almost overlapping, the latter leads to an increased sensitivity (adjusted p-value= $2.496\cdot 10^{-28}$ ) compared to the full solution (adjusted p-value= $5.181\cdot 10^{-27}$ ).

These results suggest that the denoising approach can easily adapt to situations where a sparse solution is preferable (i.e., when only few of the $M$ coefficients $LP_{j}$ are non-zero) without enforcing sparsity when many of the $M$ coefficients considered are needed to adequately fit the data (e.g., bottom right panel of Fig. 11). From an inferential perspective, denoising can improve the sensitivity of the analysis; however, in order to avoid false discoveries, extra care needs to be taken when the deviance selection procedure leads to large p-values for all the $M_{\max}$ models considered.

IX An application to stacking experiments

In radio astronomical surveys, stacking techniques are often used to combine noisy images or “stacks” in order to increase the signal-to-noise ratio and improve the sensitivity of the analysis in detecting faint sources [e.g., lawrence, , white, , jeroen, ]. In polarized signal searches, for instance, a faint population of sources is considered when the median polarized intensity observed over control regions differs significantly from the median of the region where the sources are expected to be present. In this context, under simplifying assumptions, the distribution of the intensity of the source polarization is often assumed to to have Rice distribution i.e.,

[TABLE]

where $\text{Bessel}(\cdot)$ denotes the Bessel function of first kind of order zero and $k_{\nu\sigma^{2}}$ is a normalizing constant. Furthermore, (52) reduces to a Rayleigh pdf when no signal is present simmons , i.e, when $\nu=0$ . Below, it is shown how the methods described in Sections V.1 and V.2.1 can be used to assess whether the Rayleigh distribution is a reliable model for the background and, when too simplistic, investigate the impact of incorrectly assuming a Rayleigh distribution on the reliability of the analysis.

The data considered comes from the NRAO VLA Sky Survey (NVSS) NVSS . The NVSS is an astronomical survey of the Northern hemisphere carried out by the Very Large Array of the National Radio Astronomy Observatory. The NVSS has detected 1.8 million sources in total intensity, but only $14\%$ of these have reported a polarized signal peak greater than $3\sigma$ jeroen . The original source-free sample contained $29,915$ observations collected from four different control regions for each source with a brightness in total intensity between 0 and 0.0093 Jy/beam (see upper panel of Figure 12). However, such sample appears to contain several outliers which affect the data distribution, making it far from Rayleigh. A better Rayleigh fit is obtained when removing the outliers,666In statistics, an observation $x_{i}$ is considered an outlier if $x_{i}<Q_{0.25}-1.5[Q_{0.75}-Q_{0.25}]$ or $x_{i}>Q_{0.75}+1.5[Q_{0.75}-Q_{0.25}]$ where $Q_{0.25}$ and $Q_{0.75}$ are the first and the third sample quartiles. (see bottom left panel of Figure 12). Since understanding the cause of these anomalous observations is beyond the scope of this manuscript, we proceed excluding them from the analysis and we focus on assessing the validity of the Rayleigh assumption on the remaining $28,739$ observations on the region $[0,0.0009]$ Jy/beam. It has to be noted that the nominal noise in NVSS polarization is 0.00029 Jy/beam and we may expect as reasonable threshold for the detection of one individual source to be three times the noise. Hence, a source sample of $6,220$ observations has been selected from positions where compact radio sources with a brightness in total intensity between 0 and 0.0009 Jy/beam are known to be present. Both source-free and source samples are assumed to be i.i.d. The histograms of the source-free and signal samples considered are shown in the right panel of Fig. 12.

As first step, we fit a Rayleigh distribution (adequately truncated over the range $[0,0.0009]$ ) on the source-free sample, i.e.,

[TABLE]

where $k_{\widehat{\sigma}^{2}}$ is a normalizing constant, $\widehat{\sigma}=0.0003$ is the ML estimate of the unknown parameter $\sigma$ , and $x\in[0,0.0009]$ Jy/beam. In order to assess if (53) provides a good fit for the data, we estimate the comparison density $d(G_{b}(x);G_{b},F_{b})$ by, first, selecting $M$ as in (32) and then applying the AIC-based denoising approach described in Section VIII.1. In this case, the denoised solution selects $k^{*}=9$ out of $M=10$ polynomial terms. The deviance tests and the CD plot in Fig. 13 suggest that, despite the fact that the median of the data coincides with the one of the Rayleigh model, overall, the latter does not provide a good fit for the distribution of the source-free sample. Specifically, the data distribution shows a higher right tail than one expected under the Rayleigh assumption, whereas the first quantiles are overestimated by the Rayleigh. Therefore, the researcher can either decide to use a more refined parametric model for the background or consider the calibrated background distribution of the form in (19), which in our setting specifies as

[TABLE]

where $G_{b}(x)$ is the cdf of (53).

The strategy described in Section V.2.1 allows us to identify where significant differences between the control and source sample occur. In order to assess the effect of incorrectly assuming a Rayleigh background, we compare the distribution of the physics sample with both the Rayleigh and the calibrated background distribution in (54). Figure 14 reports deviance tests and CD plots obtained on the physics sample when setting $g(x)=\widehat{f}_{b}(x)$ in (53) (left panel) and $g(x)=\widehat{f}_{b}(x)$ in (54) (right panel). Both analyses provide strong evidence that the distribution of the physics sample differs significantly from the postulated models $\widehat{f}_{b}(x)$ and $g_{b}(x)$ , and the most substantial discrepancies occur on the right tail of the distribution. However, since the Rayleigh model underestimates the right tail of the background distribution (see Fig. 13), it leads to an artificially enhanced sensitivity in this region. The differences between the two CD plots are less prominent around the median expected under $\widehat{f}_{b}(x)$ and $g_{b}(x)$ (i.e., in correspondence of $u=0.5$ in both plots).

Fig 14 suggests that, for these data, assuming a background Rayleigh distribution would not substantially affect the results of a comparison between the source-free and signal sample based on the median. However, focusing solely on the median can strongly limit the overall sensitivity of the analysis since the major differences occur at the higher quantiles of the distribution. On the other hand, assuming a Rayleigh distribution for the background would artificially inflate the evidence in favor of the source. Specifically, the sigma significance of the deviance test obtained under the Rayleigh background assumption is $23.655\sigma$ (adjusted p-value = $5.178\cdot 10^{-124}$ ), whereas the one obtained using (54) is $20.225\sigma$ (adjusted p-value = $2.948\cdot 10^{-91}$ ).

Conversely, the calibrated background model in (54) allows us to safely compare the entire distribution of the polarized intensity in the source and control regions via CD plots and deviance tests without affecting the sensitivity of the analysis.

X Discussion

This article proposes a unified framework for signal detection and characterization under background mismodelling. From a methodological perspective, the methods presented here extend LP modelling to the inferential setting.

The solution discussed is articulated in two main phases: a calibration phase where the background model is “trained” on a source-free sample and a signal search phase conducted on the physics sample collected by the experiment. If a model for the signal is given, the method proposed allows the identification of hidden signals from new unexpected sources and/or the refining of the postulated background or signal distributions. Furthermore, the tools presented in this manuscript can be easily extended to situations where a source-free sample is not available and the background is unknown (up to some free parameters). As discussed in Section VI, however, in this setting the signal distribution is required to be known, and the physics sample is expected to contain only signal-like events, i.e., the background is almost completely reduced.

The theory of Section II.4 and the analyses in Section V have highlighted that, despite a fully non-parametric approach provides reliable inference, it may lead to unsatisfactory estimates when the postulated pdf $g$ is substantially different from the true density $f$ . In this setting, a semiparametric stage can be performed in order provide a reliable model for the data.

Each individual step in both the nonparametric and the semiparametric stage of Sections V.2.2 and V.1 provides useful scientific insights on the signal and background distribution. Hence, an automatized implementation of the steps of Algorithm 1 based solely on the p-values of the deviance tests is discouraged as it would lead to a substantial loss of scientific knowledge on the phenomena under study.

Finally, it is important to point out that, despite this article’s focus on the one-dimensional searches on continuous data, all the constructs presented in Sections II and the deviance test in IV.1 also apply to the discrete case when considering i.i.d. events. More work is needed to extend these results and those of Section IV.2 to searches in multiple dimensions and when considering Poisson events with functional mean. In the first case the difficulty mainly lies in generalizing the constructs of Section IV to account for the dependence structure occuring across multiple dimensions. In the second case, the main challenge lies in identifying the equivalent of (1) to model the mean of the distribution, while incorporating the Poisson error.

Code availability

The LPBkg Python package python and the LPBkg R package rr allow the implementation of the methods proposed in this manuscript. Detailed tutorials on how to use the functions provided are also available at http://salgeri.umn.edu/my-research.

Acknowledgments

The author thanks Jeroen Stil, who provided the NVSS datasets used in Section IX, and Lawrence Rudnick, who first recognized the usefulness of the method proposed in the context of stacking experiments. Conversations with Subhadeep Mukhopadhyay have been of great help when this work was first conceptualized. Discussions and e-mail exchanges with Charles Doss and Chad Shafer are gratefully acknowledged. Finally, the author thanks an anonymous referee whose feedback has been substantial to improve the overall quality of the paper.

Appendix A Moments of the $\widehat{LP}_{j}$ estimates

Consider the general setting where $f\not\equiv g$ and thus $d(u;G,F)\neq 1$ over $[0,1]$ . It follows that each $u_{i}$ is independently and identically distributed with pdf $d(u;G,F)$ ; hence, all the expectations in $E[\widehat{LP}_{j}],V(\widehat{LP}_{j})$ and $Cov(\widehat{LP}_{j},\widehat{LP}_{k})$ are taken with respect to $d(u;G,F)$ . Specifically,

[TABLE]

where the second equality follows by the fact that each observed value $u_{i}$ is a realization of a random variable $U_{i}$ and each $U_{1},\dots,U_{n}$ is identically distributed as the random variable $U$ , whose pdf is given by the comparison density $d(u;G,F)$ . Notice that $d(u;G,F)=1$ implies that $\int_{0}^{1}Leg_{j}(u)d(u;G,F)\partial{u}=\int_{0}^{1}Leg_{j}(u)\partial{u}=0$ , from which the first equivalence in (8) follows. Moreover,

[TABLE]

where $V\bigl{(}Leg_{j}(U)\bigl{)}=\int_{0}^{1}(Leg_{j}(u)-LP_{j})^{2}d(u;G,F)\partial{u}=\sigma^{2}_{j}$ . The second equality holds because of independence and identical distribution of each $u_{i}$ . Notice that if $d(u;G,F)=1$ , $\sigma^{2}_{j}=1$ in virtue of the orthonormality of the $Leg_{j}(u)$ polynomials. Hence the second equivalence in (8) holds. Finally,

[TABLE]

also in this case, the second equality follows by independence and identical distribution of each $u_{i}$ and

[TABLE]

Because of the orthogonality of the $Leg_{j}(u)$ , $\sigma_{jk}=0$ when $d(u;G,F)=1$ . Hence the third equivalence in (8).

Appendix B Bias, variance and MISE of $\widehat{d}(u;G,F)$

Given a point $u$ over $[0,1]$ , the bias of (9) at $u$ is

[TABLE]

here (57) follows from (4) and (9). Whereas, the integrated squared bias is

[TABLE]

where (62) holds because of orthonormality of the $Leg_{j}(u)$ polynomials. Notice that

[TABLE]

where (64) follows by Parseval’s identity whereas (65) follows from (2).

The variance of (9) at a given point $u$ is given by

[TABLE]

By orthonormality of the polynomials $Leg_{j}(u)$ , the integral of (69) over $[0,1]$ is

[TABLE]

also in this case, equality follows by orthonormality of the $Leg_{j}(u)$ . Finally, the MISE is

[TABLE]

where (72) holds because of Fubini-Tonelli theorem, whereas the last equality follows by (62) and (70).

Bibliography41

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Aprile, E., et al. Physical review letters 119.18 (2017): 181301. URL: https://journals.aps.org/prl/abstract/10.1103/Phys Rev Lett.119.181301
2[2] Agnese, R., et al. Physical Review D 99.6 (2019): 062001. URL:
3[3] Smith, R. and Thrane, E. Physical Review X, no. 2 (2018): 021019. URL:
4[4] Sirunyan, A.M., et al. Physics Letters B 793 (2019): 320-347. URL:
5[5] Yellin, S. Physical Review D 66.3 (2002): 032005.
6[6] Priel, N., et al. Journal of Cosmology and Astroparticle Physics 2017.05 (2017): 013.
7[7] Dauncey, P. D., et al. Journal of Instrumentation 10.04 (2015): P 04015.
8[8] Mukhopadhyay, S., and Parzen, E. ar Xiv preprint ar Xiv:1405.2601 (2014).

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Detecting new signals under background mismodelling

Abstract

pacs:

I Introduction

II LP Approach to Statistical Modelling

II.1 The skew-G density model

II.2 LP skew-G series representation

II.3 LP density estimate

II.4 The bias variance trade-off

III Data-driven corrections for misspecified background models

IV LP-based inference

IV.1 The deviance test

IV.2 Confidence bands

IV.3 Choice of MMM

IV.3.1 Adjusting for post-selection

V A data-scientific approach to signal searches

V.1 Background calibration

V.2 Signal search

V.2.1 Nonparametric signal detection

V.2.2 Semiparametric signal characterization

VI Signal detection without calibration sample and model selection

VII Background mismodelling due to instrumental noise and upper limits constructions

VII.1 Modelling the instrumental error

VII.2 Signal detection and upper limit construction

VIII model-denoising

VIII.1 AIC denoising

VIII.2 Inference after denoising

VIII.3 Comparing full and denoised solution

IX An application to stacking experiments

X Discussion

Code availability

Acknowledgments

Appendix A Moments of the LP^j\widehat{LP}_{j}LPj​ estimates

Appendix B Bias, variance and MISE of d^(u;G,F)\widehat{d}(u;G,F)d(u;G,F)

IV.3 Choice of $M$

Appendix A Moments of the $\widehat{LP}_{j}$ estimates

Appendix B Bias, variance and MISE of $\widehat{d}(u;G,F)$