Portraying Double Higgs at the Large Hadron Collider

Jeong Han Kim; Minho Kim; Kyoungchul Kong; Konstantin T. Matchev,; Myeonghun Park

arXiv:1904.08549·hep-ph·September 19, 2019

Portraying Double Higgs at the Large Hadron Collider

Jeong Han Kim, Minho Kim, Kyoungchul Kong, Konstantin T. Matchev,, Myeonghun Park

PDF

TL;DR

This paper proposes a deep learning approach to enhance the detection of double Higgs production at the LHC, significantly improving signal sensitivity in a challenging final state with two b-jets, two leptons, and missing energy.

Contribution

It introduces a novel deep learning framework that leverages full kinematic information and jet images to improve double Higgs signal detection at the LHC.

Findings

01

Substantial increase in signal sensitivity over existing methods

02

Deep learning effectively captures correlations among input variables

03

Method adaptable to other processes with similar final states

Abstract

We examine the discovery potential for double Higgs production at the high luminosity LHC in the final state with two $b$ -tagged jets, two leptons and missing transverse momentum. Although this dilepton final state has been considered a difficult channel due to the large backgrounds, we argue that it is possible to obtain sizable signal significance, by adopting a deep learning framework making full use of the relevant kinematics along with the jet images from the Higgs decay. For the relevant number of signal events we obtain a substantial increase in signal sensitivity over existing analyses. We discuss relative improvements at each stage and the correlations among the different input variables for the neutral network. The proposed method can be easily generalized to the semi-leptonic channel of double Higgs production, as well as to other processes with similar final states.

Tables1

Table 1. Table 1: Signal and background cross sections in fb after baseline cuts (first row) and at different stages of analysis, using a combination of kinematic variables and jet images while requiring N = 20 𝑁 20 N=20 signal events. The significance σ 𝜎 \sigma is calculated using the log-likelihood ratio for a luminosity of 3 ab − 1 superscript ab 1 \rm{ab}^{-1} at the 14 TeV LHC.

	Signal	$t \bar{t}$	$t \bar{t} h$	$t \bar{t} V$	$ℓ ℓ b j$	$τ τ b b$	$t w + j$	$j j ℓ ℓ ν ν$	$σ$	$S / B$
Baseline cuts: $/ P_{T} > 20 GeV$ ,	$0.01046$	$1.8855$	$0.0269$	$0.0179$	$0.0697$	$0.0250$	$0.2209$	$0.0113$	$0.38$	$0.0046$
$p_{T, ℓ} > 20 GeV$ , $Δ R_{ℓ ℓ} < 1.0$ ,
$p_{T, b} > 30 GeV$ , $Δ R_{b b} < 1.3$ ,
$m_{ℓ ℓ} < 65 GeV$ , $95 < m_{b b} < 140 GeV$
jet-image DL	0.00667	0.1855	0.0147	0.00731	0.0243	0.0128	0.0626	0.00786	0.65	0.021
10 low-level variables DL	0.00668	0.0738	0.0132	0.00529	0.0184	0.00842	0.0424	0.00516	0.89	0.040
16 variables DL	0.00668	0.0676	0.0109	0.00454	0.0163	0.00689	0.0376	0.00418	0.94	0.045
10 variables + jet-image DL	0.00667	0.0630	0.00964	0.00429	0.0194	0.00791	0.0343	0.00393	0.96	0.047
16 variables + jet-image DL	0.00668	0.0602	0.00914	0.00252	0.0133	0.00689	0.0299	0.00344	1.0	0.053

Equations58

V = \frac{m _{h}^{2}}{2} h^{2} + κ_{3} λ_{3}^{SM} v h^{3} + \frac{1}{4} κ_{4} λ_{4}^{SM} h^{4},

V = \frac{m _{h}^{2}}{2} h^{2} + κ_{3} λ_{3}^{SM} v h^{3} + \frac{1}{4} κ_{4} λ_{4}^{SM} h^{4},

λ_{3}^{SM} = λ_{4}^{SM} = \frac{m _{h}^{2}}{2 v ^{2}}

λ_{3}^{SM} = λ_{4}^{SM} = \frac{m _{h}^{2}}{2 v ^{2}}

Δ R_{ij} = (Δ ϕ_{ij})^{2} + (Δ η_{ij})^{2},

Δ R_{ij} = (Δ ϕ_{ij})^{2} + (Δ η_{ij})^{2},

/ P_{T} = - (\sum p_{T ℓ} + \sum p_{T γ} + \sum p_{T j} + \sum p_{T (track)}) .

/ P_{T} = - (\sum p_{T ℓ} + \sum p_{T γ} + \sum p_{T j} + \sum p_{T (track)}) .

χ_{ij}^{2}

χ_{ij}^{2}

T

T

H

H

m_{W^{*}}^{p e ak} = \frac{1}{3} 2 (m_{h}^{2} + m_{W}^{2}) - m_{h}^{4} + 14 m_{h}^{2} m_{W}^{2} + m_{W}^{4} .

m_{W^{*}}^{p e ak} = \frac{1}{3} 2 (m_{h}^{2} + m_{W}^{2}) - m_{h}^{4} + 14 m_{h}^{2} m_{W}^{2} + m_{W}^{4} .

\frac{d σ}{d m _{ν \overset{ν}{ˉ}}} \propto \int d m_{W^{*}}^{2} λ^{1/2} (m_{h}^{2}, m_{W}^{2}, m_{W^{*}}^{2}) f (m_{ν \overset{ν}{ˉ}}),

\frac{d σ}{d m _{ν \overset{ν}{ˉ}}} \propto \int d m_{W^{*}}^{2} λ^{1/2} (m_{h}^{2}, m_{W}^{2}, m_{W^{*}}^{2}) f (m_{ν \overset{ν}{ˉ}}),

\displaystyle f(m)\sim\left\{\begin{array}[]{l l}\eta\,m\,,&0\leq m\leq e^{-\eta}E,\\[2.84526pt] m\ln(E/m)\,,&e^{-\eta}E\leq m\leq E,\end{array}\right.

\displaystyle f(m)\sim\left\{\begin{array}[]{l l}\eta\,m\,,&0\leq m\leq e^{-\eta}E,\\[2.84526pt] m\ln(E/m)\,,&e^{-\eta}E\leq m\leq E,\end{array}\right.

E

E

cosh η

\overset{s}{^}_{min}^{(v)} = m_{v}^{2} + 2 (∣ P_{T}^{v} ∣^{2} + m_{v}^{2} ∣ / P_{T} ∣ - P_{T}^{v} \cdot / P_{T}),

\overset{s}{^}_{min}^{(v)} = m_{v}^{2} + 2 (∣ P_{T}^{v} ∣^{2} + m_{v}^{2} ∣ / P_{T} ∣ - P_{T}^{v} \cdot / P_{T}),

M_{T 2} (\tilde{m}) \equiv min {max [M_{T P_{1}} (p_{T ν}, \tilde{m}), M_{T P_{2}} (p_{T \overset{ν}{ˉ}}, \tilde{m})]},

M_{T 2} (\tilde{m}) \equiv min {max [M_{T P_{1}} (p_{T ν}, \tilde{m}), M_{T P_{2}} (p_{T \overset{ν}{ˉ}}, \tilde{m})]},

\sigma_{dis}\equiv\sqrt{-2\,\ln\bigg{(}\frac{L(B|S\!+\!B)}{L(S\!+\!B|S\!+\!B)}\bigg{)}}\;\;\;\;\;\text{with}\;\;\;L(x|n)=\frac{x^{n}}{n!}e^{-x}\,,

\sigma_{dis}\equiv\sqrt{-2\,\ln\bigg{(}\frac{L(B|S\!+\!B)}{L(S\!+\!B|S\!+\!B)}\bigg{)}}\;\;\;\;\;\text{with}\;\;\;L(x|n)=\frac{x^{n}}{n!}e^{-x}\,,

\Delta\chi^{2}=\left(\frac{\Big{(}S(\kappa_{3})+B\Big{)}-\Big{(}S(\kappa_{3}=1)+B\Big{)}}{\sqrt{S(\kappa_{3}=1)+B}}\right)^{2}\,.

\Delta\chi^{2}=\left(\frac{\Big{(}S(\kappa_{3})+B\Big{)}-\Big{(}S(\kappa_{3}=1)+B\Big{)}}{\sqrt{S(\kappa_{3}=1)+B}}\right)^{2}\,.

σ_{g g \to hh} (\overset{s}{^})

σ_{g g \to hh} (\overset{s}{^})

O [i] = j = 0 \sum n_{I} - 1 (W [i, j] I [j] + B [i]), i = 0, \dots, n_{O} - 1,

O [i] = j = 0 \sum n_{I} - 1 (W [i, j] I [j] + B [i]), i = 0, \dots, n_{O} - 1,

I [j] = (I [1, 1] \dots I [1, n] \dots I [n, 1] \dots I [n, n]),

I [j] = (I [1, 1] \dots I [1, n] \dots I [n, 1] \dots I [n, n]),

ReLU (x [i])

ReLU (x [i])

Sigmoid (x [i])

SoftMax (x [i])

=

=

=

=

Classification Error = \frac{1}{n} j = 1 \sum n δ [ArgMax ({t [i]}_{j}), ArgMax ({x [i]}_{j})],

Classification Error = \frac{1}{n} j = 1 \sum n δ [ArgMax ({t [i]}_{j}), ArgMax ({x [i]}_{j})],

O [i]

O [i]

\hat{I} [i]

μ [i]

\displaystyle\Big{(}\sigma[i]\Big{)}^{2}

O[i,j,k]=\sum^{n_{f}-1}_{\gamma=0}\sum_{\alpha,\beta}\big{(}\mathcal{W}[\alpha,\beta,\gamma,k]I[\alpha,\beta,\gamma]+\mathcal{B}[k]\big{)}\,,

O[i,j,k]=\sum^{n_{f}-1}_{\gamma=0}\sum_{\alpha,\beta}\big{(}\mathcal{W}[\alpha,\beta,\gamma,k]I[\alpha,\beta,\gamma]+\mathcal{B}[k]\big{)}\,,

I = (α γ β δ) \to O = 0000 0 α γ 0 0 β δ 0 0000,

I = (α γ β δ) \to O = 0000 0 α γ 0 0 β δ 0 0000,

O [i, j, k] = Max (Average) ({I [α, β, k]}),

O [i, j, k] = Max (Average) ({I [α, β, k]}),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

aainstitutetext: Department of Physics and Astronomy, University of Kansas, Lawrence, KS 66045, USAbbinstitutetext: Department of Physics, POSTECH, 77 Cheongam-ro, Nam-gu, Pohang, 37673, Koreaccinstitutetext: Institute of Convergence Fundamental Studies and School of Liberal Arts, Seoultech, 232 Gongneung-ro, Nowon-gu, Seoul, 01811, Koreaddinstitutetext: Institute for Fundamental Theory, Physics Department, University of Florida, Gainesville, FL 32611, USA

Portraying Double Higgs at the Large Hadron Collider

Jeong Han Kim b,c

Minho Kim a

Kyoungchul Kong d

Konstantin T. Matchev c

Myeonghun Park

[email protected]

Abstract

We examine the discovery potential for double Higgs production at the high luminosity LHC in the final state with two $b$ -tagged jets, two leptons and missing transverse momentum. Although this dilepton final state has been considered a difficult channel due to the large backgrounds, we argue that it is possible to obtain sizable signal significance, by adopting a deep learning framework making full use of the relevant kinematics along with the jet images from the Higgs decay. For the relevant number of signal events we obtain a substantial increase in signal sensitivity over existing analyses. We discuss relative improvements at each stage and the correlations among the different input variables for the neutral network. The proposed method can be easily generalized to the semi-leptonic channel of double Higgs production, as well as to other processes with similar final states.

1 Introduction

The discovery of the Higgs boson Aad:2012tfa ; Chatrchyan:2012xdj jumpstarted the comprehensive program of precision measurements of all Higgs couplings. While the Higgs boson couplings to fermions and gauge bosons are in good agreement with the Standard Model (SM) predictions Khachatryan:2016vau , the Higgs self-couplings are difficult to measure experimentally ATL-PHYS-PUB-2017-001 ; ATL-PHYS-PUB-2016-024 ; Kim:2018uty ; Sirunyan:2018two ; CMS:2015nat ; CMS:2017cwx ; Baglio:2012np ; Sirunyan:2017guj ; Cepeda:2019klc . Yet, the knowledge of those couplings is crucial for understanding the exact mechanism of electroweak symmetry breaking and the origin of mass in our universe. It is also a guaranteed physics target which can be probed at the upgraded Large Hadron Collider (LHC) or at future colliders. The resulting experimental constraints on the Higgs self-couplings will have an immediate and long-lasting impact on model-building efforts beyond the SM.

We parameterize the Higgs self-interaction as follows:

[TABLE]

where $m_{h}$ is the mass of the SM Higgs boson ( $h$ ), $v\approx 256$ GeV is the Higgs vacuum expectation value,

[TABLE]

are the SM values for the Higgs self-couplings, while $\kappa_{3}$ and $\kappa_{4}$ parametrize the corresponding deviations from them. In order to access $\kappa_{3}$ ( $\kappa_{4}$ ), one has to measure the process of double (triple) Higgs boson production at the LHC, possibly with high luminosity (HL), or at future colliders.

Double Higgs ( $hh$ ) production has been studied in many channels, including $b\bar{b}b\bar{b}$ ATLAS:2018combi ; Aaboud:2018knk ; CMS:2018smw ; deLima:2014dta ; Wardrope:2014kya ; Behr:2015oqq , $b\bar{b}\gamma\gamma$ Sirunyan:2018iwt ; Aaboud:2018ftw ; CMS-PAS-FTR-15-002 ; ATL-PHYS-PUB-2014-019 ; Kim:2018uty ; Kling:2016lay ; Baur:2003gp ; Baglio:2012np ; Huang:2015tdv ; Azatov:2015oxa ; Cao:2015oaa ; Cao:2016zob ; Alves:2017ued ; Barger:2013jfa ; Chang:2018uwu , $b\bar{b}\tau\tau$ CMS-PAS-FTR-15-002 ; Aaboud:2018sfw ; Sirunyan:2017djm ; Kim:2018uty ; Baur:2003gpa ; Goertz:2014qta ; Dolan:2012rv , $b\bar{b}W^{+}W^{-}/ZZ$ Aaboud:2018zhh ; CMS:2017ums ; CMS:2015nat ; CMS:2017cwx ; Kim:2018cxf ; Papaefstathiou:2012qe ; Huang:2017jws , $W^{+}W^{-}W^{+}W^{-}$ Aaboud:2018ksn , etc. Among the different possible final states, here we focus on $hh$ production at the HL-LHC in the final state with two $b$ -tagged jets, two leptons and missing transverse momentum. The signal process is $(h\to b{\bar{b}})(h\to W^{\pm}W^{*\mp}\to\ell^{+}\nu_{\ell}{\ell^{\prime}}^{-}\bar{\nu}_{\ell^{\prime}})$ and it suffers from large SM backgrounds, primarily due to top quark pair production ( $t\bar{t}$ ). The few existing studies in this channel therefore employ sophisticated algorithms (neutral network (NN) CMS:2015nat , deep neutral network (DNN) Sirunyan:2017guj , boosted decision tree (BDT) Adhikary:2017jtu ; CMS:2017cwx , etc.) to increase the signal sensitivity, but show somewhat pessimistic results, with a significance no better than $1\sigma$ at the HL-LHC with 3 ab*-1* luminosity.

The recent study in Ref. Kim:2018cxf introduced some new ideas for reducing the SM backgrounds in this channel. For example, the new variables Topness and Higgsness were designed to test whether the event kinematics is consistent with $t\bar{t}$ or $hh$ , respectively. The use of Topness and Higgsness already effectively reduced the $t\bar{t}$ background to a manageable level, and additional variables were then employed to handle the remaining SM background processes — e.g., the subsystem variable $M_{T2}^{(\ell)}$ is effective in eliminating background arising from $\tau$ decays. In this paper, we supplement the novel kinematic method from Ref. Kim:2018cxf with the analysis of the jet image in the $h\to b\bar{b}$ decay, where the basic idea is to treat the detector as a camera and the streams of jets as an image Bhattacherjee:2019fpt ; Gallicchio:2010sw ; Gallicchio:2010dq ; Hook:2011cq ; Cogan:2014oua ; deOliveira:2015xxd ; Lin:2018cin ; deOliveira:2017pjk . In our case, the collimated nature of the Higgs decay will hopefully differ from the patterns obtained in SM production processes. In addition, we adopt a deep learning framework in our main analysis, since it is known that modern deep learning algorithms trained on jet images provide improved signal-to-background discrimination Gallicchio:2010sw ; Gallicchio:2010dq ; Hook:2011cq ; Baldi:2014kfa ; Cogan:2014oua ; deOliveira:2015xxd ; Komiske:2016rsd ; Kasieczka:2017nvn ; Lin:2018cin .

The analysis presented in this paper contains a number of improvements in comparison to previous studies:

•

Unlike the customized detector simulation performed in Ref. Kim:2018cxf , here we employ Delphes deFavereau:2013fsa to simulate detector effects such as detector resolution, reconstruction efficiency, etc., and Fastjet Cacciari:2011ma for jet-reconstruction.

•

We use deep learning framework to optimize the cuts, which further increases the significance compared to the conventional cut-and-count as performed in Ref. Kim:2018cxf .

•

We exploit an enlarged set of relevant variables which consists of the 10 variables originally considered in Ref. Adhikary:2017jtu : $p_{T\ell_{1}}$ , $p_{T\ell_{2}}$ , ${\;/\!\!\!\!{P}_{T}}$ , $m_{\ell\ell}$ , $m_{bb}$ , $\Delta R_{\ell\ell}$ , $\Delta R_{bb}$ , $p_{Tbb}$ , $p_{T\ell\ell}$ , and $\Delta\phi_{bb,\ell\ell}$ , supplemented with the six recent variables from Ref. Kim:2018cxf : Topness, Higgsness, $M_{T2}^{(b)}$ , $M_{T2}^{(\ell)}$ , $\hat{s}_{min}^{(\ell\ell)}$ and $\hat{s}_{min}^{(bb\ell\ell)}$ .

•

We include a SM background process, $tW$ production, which was missing from all previous discussions of this channel, yet it turns out to be the next dominant background once the $t\bar{t}$ background is under control.

•

The fact that the Higgs boson $h$ is a color-singlet allows us to use the jet image of the $h\to b\bar{b}$ decay for further background suppression Cogan:2014oua ; Gallicchio:2010sw ; Gallicchio:2010dq ; Hook:2011cq ; Lin:2018cin .

•

We examine the effect of pile-up, which was missing from previous studies. The expected average number of pile-up $\left<\mu\right>$ at the HL-LHC is $\mathcal{O}(200)$ collisions per bunch crossing ATL-PHYS-PUB-2019-005 . Thus for any precision measurements, it is crucial to have a strategy in place to ensure that pile-up effects do not jeopardize the analysis. Here we choose to apply the Soft Drop algorithm Larkoski:2014wba for QCD analyses, which is a powerful pile-up mitigation technique. In order to reduce pile-up effects on the relevant kinematic variables, we adopt the definition for a missing transverse momentum from ATLAS, which excludes contributions from soft neutral particles Aaboud:2018tkc .

Our results show that the dominant $t\bar{t}$ background can be significantly reduced until it is comparable to the other subdominant backgrounds, i.e., after all cuts, we find that all SM backgrounds contribute at similar levels. This reduction can be accomplished without sacrificing too much of the signal rate, which leads to an improved signal significance. Our study indicates that the dilepton channel from $hh\to b\bar{b}W^{+}W^{-}$ could contribute to the combined significance for $hh$ discovery on par with the other final states, making double Higgs production sooner accessible at the HL-LHC.

This paper is structured as follows. We begin our discussion of the SM backgrounds and present the details of our simulation in section 2. In the following two sections 3 and 4, we provide some basic information on the kinematic variables used later in the analysis and on jet images, respectively. Then in section 5 we discuss how we set up our analysis in a deep learning framework. Section 6 presents our results, while section 7 is reserved for the discussion and conclusions. We include a brief review on deep neural networks in Appendix A.

2 Event generation and detector simulation

Parton-level signal and background events were generated using MadGraph5_aMC@NLO v2.6 Alwall:2014hca with the default NNPDF2.3QED parton distribution functions Ball:2013hta at leading order QCD accuracy at the $\sqrt{s}=14$ TeV LHC. The default dynamical renormalization and factorization scales were used. We assume 3ab*-1* of luminosity throughout this paper. Parton-level events were generated with the following cuts: $p_{Tj}>20$ GeV, $p_{Tb}>20$ GeV, $p_{T\gamma}>10$ GeV, $p_{T\ell}>10$ GeV, $\eta_{j}$ < 5, $\eta_{b}$ < 5, $\eta_{\gamma}$ < 2.5, $\eta_{\ell}$ < 2.5, $\Delta R_{bb}<$ 1.8, $\Delta R_{\ell\ell}<$ 1.3, 70 GeV $<m_{jj},m_{bb}<$ 160 GeV and $m_{\ell\ell}<$ 75 GeV. For $jj\ell\ell\nu\bar{\nu}$ , $\ell\ell bj$ and $tW+j$ backgrounds, we impose 5 GeV $<m_{\ell\ell}<75$ GeV additionally. Here the angular distance $\Delta R_{ij}$ is defined by

[TABLE]

where $\Delta\phi_{ij}=\phi_{i}-\phi_{j}$ and $\Delta\eta_{ij}=\eta_{i}-\eta_{j}$ are respectively the differences of the azimuthal angles and rapidities between particles $i$ and $j$ .

The double Higgs production cross-section is normalized to $\sigma_{hh}=40.7$ fb, the next-to-next-to-leading order (NNLO) accuracy in QCD Grigo:2014jma . Considering all relevant branching fractions, we obtain signal cross section $\sigma_{hh}\cdot 2\cdot\text{BR}(h\rightarrow b\overline{b})\cdot\text{BR}(h\rightarrow WW^{*}\rightarrow\ell^{+}\ell^{-}\nu\bar{\nu})=0.648$ fb, where $\ell$ denotes an electron or a muon, including leptons from tau decays. The major background is $t\overline{t}$ production, whose cross section is normalized to the NNLO QCD cross-section 953.6 pb Czakon:2013goa . Another important background is $t\overline{t}h$ , which is normalized to the next-to-leading order (NLO) QCD cross-section of 611.3 fb Dittmaier:2011ti . For the $t\overline{t}V$ ( $V=W^{\pm},Z$ ) background, we apply an NLO k-factor of 1.54, resulting in a cross-section of 1.71 pb deFlorian:2016spz . We apply an NLO k-factor of 1.0 for the Drell-Yan type backgrounds $\ell\ell bj$ and $\tau\tau bb$ , where $j$ denotes partons in the five-flavor scheme. Note that a recent study indicates that ${\rm k}^{NNLO,DY}_{QCD\otimes QED}\approx 1$ deFlorian:2018wcj . The irreducible $jj\ell\ell\nu\nu$ background from the mixed QCD+EW process is included with ${\rm k}_{NLO}$ = 2. Finally, we generate $tW+j$ events with up to one additional matched jet (in the five-flavor scheme), whose cross-section turns out to be 0.51 pb (after the cuts) including all relevant branching fractions. As we try to reconstruct events, off-shell effects for the top quark and $W$ boson need to be taken care of properly. We generate parton level events with MadGraph5, which includes the proper treatment of the off-shell effects for the top quark and the $W$ boson for both signal and all backgrounds.

Events are further processed for parton-shower/hadronization using Pythia8235 Sjostrand:2014zea . We use Delphes 3.4.1 deFavereau:2013fsa for simulating the detector effects and Fastjet 3.3.1 Cacciari:2011ma for jet-reconstruction, with modified ATLAS settings as follows.

•

Jets are clustered with the anti- $k_{T}$ algorithm Cacciari:2008gp with cone-size $\Delta R=0.4$ , where $\Delta R$ is the distance (2) in the ( $\phi$ , $\eta$ ) space. For the analysis, we consider jets with $p_{Tj}>20$ GeV and $|\eta_{j}|<2.5$ .

•

We use the a flat $b$ -tagging efficiency, $\epsilon_{b\rightarrow b}=0.75$ , and flat mis-tagging rates for non- $b$ jets of $\epsilon_{c\rightarrow b}=0.1$ and $\epsilon_{j\rightarrow b}=0.01$ ATL-PHYS-PUB-2019-005 .

•

For lepton isolation, we require $\frac{p_{T\ell}}{p_{T\ell}+\sum_{i}p_{Ti}}>0.7$ , where the sum is taken over the transverse momenta $p_{Ti}$ of all final states particles $i$ , $i\neq\ell$ , with $p_{Ti}>0.5$ GeV and within $\Delta R_{i\ell}<0.3$ of the lepton candidate $\ell$ . Leptons are also required to have $p_{T\ell}>10$ GeV and $|\eta_{\ell}|<2.5$ .

•

For photon isolation, we analogously require $\frac{\sum_{i}p_{Ti}}{p_{T\gamma}}<0.12$ for particles within $\Delta R_{i\gamma}<0.3$ of the photon candidate $\gamma$ . Photons are also required to have $p_{T\gamma}>25$ GeV and $|\eta_{\gamma}|<2.5$ .

•

The missing transverse momentum ${\;/\!\!\!\!\vec{P}_{T}}$ is defined as the negative vector sum of the transverse momenta of the accepted leptons, photons, jets and soft tracks as follows Aaboud:2018tkc ;

[TABLE]

Here the last term is added to consider unused soft tracks. These tracks are required to have $p_{T}>0.4$ GeV, $|\eta|<2.5$ and transverse (longitudinal) impact parameter $|d_{0}|<1.5\,\textrm{mm}\,(|z_{0}\sin\theta|<1.5\,\textrm{mm})$ . To reduce effects from pile-up, we only use particles which have track information.

After particle reconstruction, we employ the following baseline selection cuts111For the motivation behind these cuts, see Fig. 1 (in which the cut values are indicated with vertical dotted lines) and the related discussion in Sec. 3 below. from Ref. Kim:2018cxf :

•

the two leading jets must be $b$ -tagged, each with $p_{T}>30$ GeV,

•

exactly two isolated leptons of opposite sign, each with $p_{T\ell}>20$ GeV,

•

${\;/\!\!\!\!{P}_{T}}=|{\;/\!\!\!\!\vec{P}_{T}}|>20$ GeV for the reconstructed missing transverse momentum,

•

proximity cut of $\Delta R_{\ell\ell}<1.0$ for the two leptons,

•

proximity cut of $\Delta R_{bb}<1.3$ for the two $b$ -tagged jets,

•

$m_{\ell\ell}<65$ GeV for the two leptons,

•

$95$ GeV $<m_{bb}<140$ GeV for the two $b$ -tagged jets.

For those events which passed the baseline cuts, we form 16 kinematic variables, as well as jet images. As we will see later, the jet images can capture additional features which are not already contained in the 16 standard kinematic variables. Therefore one can obtain better performance by combining kinematics and jet images, which is one of the main ideas of this paper.

3 Kinematics in signal and backgrounds

In this section we introduce the 16 kinematic variables used in this analysis. Their kinematic distributions (for signal and all relevant backgrounds) are shown in Fig. 1 and will be discussed shortly.

We begin with ten standard kinematic variables, which were previously considered in Refs. Adhikary:2017jtu ; CMS:2017cwx (their distributions are shown in the first ten panels of Fig. 1):

•

$m_{bb}$ , the invariant mass of the two $b$ -tagged jets (1st plot in the 1st row). This is expected to be a good variable, since for signal events, the two $b$ -jets originate from the decay of a narrow resonance (the Higgs boson) and would therefore reconstruct to the Higgs mass, up to resolution effects: $m_{bb}\sim m_{h}$ . This justifies the baseline cut of $95$ GeV $<m_{bb}<140$ GeV, as indicated with the vertical dotted lines. In contrast, no such correlations exists for backgrounds events: the two $b$ -jets either originate from different decay chains and are uncorrelated (as in the case of $t\bar{t}$ , for example), or they reconstruct to the mass of a $Z$ -boson or an off-shell gluon, with a mass lower than $m_{h}$ . The plot in Fig. 1, while confirming those expectations, also shows that the total background happens to peak at a value of $m_{bb}$ which, unfortunately, is not too far away from $m_{h}$ , providing the motivation to explore other variables.

•

$m_{\ell\ell}$ , the invariant mass of the two leptons (2nd plot in the 1st row). For the case of the signal, the two leptons ultimately originate from the Higgs boson decay, and therefore their invariant mass $m_{\ell\ell}$ is bounded from above, hence the baseline cut of $m_{\ell\ell}<65$ GeV. Note that the $m_{\ell\ell}$ distribution (which is observable) should be the same as the distribution of $m_{\nu\nu}$ (which is unobservable).

•

$\Delta R_{bb}$ , the angular separation (2) between the two $b$ -tagged jets (3rd plot in the 1st row). Given the relatively low Higgs mass, the two Higgs particles in $hh$ production have sizable transverse momentum and their respective decay products (e.g., the two $b$ -quarks) tend to go in the same direction. This and the next four variables try to exploit this kinematic property of the signal. For example, the Higgs boost implies that $\Delta R_{bb}$ is relatively small for signal events, and this motivates the proximity cut of $\Delta R_{bb}<1.3$ .

•

$\Delta R_{\ell\ell}$ , the angular separation (2) between the two leptons (4th plot in the 1st row). Here the same arguments apply as in the case of $\Delta R_{bb}$ just discussed. The corresponding plot in Fig. 1 confirms that the signal $\Delta R_{\ell\ell}$ distribution peaks well below most of the background processes, prompting the baseline cut of $\Delta R_{\ell\ell}<1.0$ .

•

$\Delta\phi_{bb,\ell\ell}$ , the azimuthal angle in the transverse plane between the two $b$ -jet system and the two lepton system (1st plot in the 2nd row). This is yet another way to capture the back-to-back boost of the two Higgs bosons in double Higgs production. Fig. 1 shows that the signal peaks at $\Delta\phi_{bb,\ell\ell}=\pm\pi$ more sharply than the background, which could be exploited later in the neural network analysis. However, no baseline cut was applied in this case, since $\Delta\phi_{bb,\ell\ell}$ is expected to be largely correlated with $\Delta R_{bb}$ and $\Delta R_{\ell\ell}$ .

•

$p_{Tbb}$ , the transverse momentum of the two $b$ -jet system (2nd plot in the 2nd row). Like the previous three variables, this variable is motivated by the significant boost of the Higgs bosons in the signal, but no baseline cut was applied.

•

$p_{T\ell\ell}$ , the transverse momentum of the two lepton system (3rd plot in the 2nd row). This variable behaves similarly to $p_{Tbb}$ , but to a lesser extent, since the two leptons come from separate $W$ s, while the two $b$ -quarks are direct decay products of the Higgs boson.

•

${\;/\!\!\!\!{P}_{T}}=|{\;/\!\!\!\!\vec{P}_{T}}|$ , the magnitude of the missing transverse momentum (4th plot in the 2nd row). A ${\;/\!\!\!\!{P}_{T}}$ cut is routinely applied in order to fight the QCD backgrounds (not shown in Fig. 1). Following Ref. CMS:2017cwx , here we use a baseline cut of ${\;/\!\!\!\!{P}_{T}}>20$ GeV.

•

$p_{T\ell_{1}}$ , the transverse momentum of the hardest lepton (1st plot in the 3rd row).

•

$p_{T\ell_{2}}$ , transverse momentum of the next-hardest lepton (2nd plot in the 3rd row). As shown in Fig. 1, the individual transverse momenta of the two leptons are similar for both signal and backgrounds. Therefore, the lepton $p_{T}$ ’s may be good for triggering purposes, but not for background rejection.

We note that for the signal, many of these 10 variables are strongly correlated to each other222The strong correlation arises due to the very nature of double Higgs production — the two Higgs particles are produced with a sizable transverse momentum, which restricts the kinematics of their decay products.. This implies that cutting on one variable significantly reduces the power of other variables. At the same time, while these 10 variables are among the most commonly used in high energy physics, it is not guaranteed that they fully capture all kinematic differences between signal and background. This is why we introduce six additional variables Kim:2018cxf : Topness, Higgsness, $\sqrt{\hat{s}}^{(bb\ell\ell)}_{min}$ , $\sqrt{\hat{s}}^{(\ell\ell)}_{min}$ , $M_{T2}^{(b)}$ and $M_{T2}^{(\ell)}$ , shown in the last six panels of Fig. 1, which are meant to take full advantage of the kinematic differences between the signal and background event topologies.

The Topness variable measures the degree of consistency of a given event with the kinematics of dilepton $t\bar{t}$ production, where there are 6 unknowns (the three-momenta of the two neutrinos, $\vec{p}_{\nu}$ and $\vec{p}_{\bar{\nu}}$ ) and four on-shell constraints, $m_{t}$ , $m_{\bar{t}}$ , $m_{W^{+}}$ and $m_{W^{-}}$ . Here $m_{t}=m_{\bar{t}}$ is the mass of top or antitop quark, and $m_{W^{\pm}}=m_{W}$ is the mass of the $W$ boson. Then the neutrino momenta can be fixed by minimizing the following quantity

[TABLE]

subject to the missing transverse momentum constraint, ${\;/\!\!\!\!\vec{P}_{T}}=\vec{p}_{T\nu}+\vec{p}_{T\bar{\nu}}$ . The parameters $\sigma_{t}$ and $\sigma_{W}$ are indicative of the corresponding experimental resolutions and intrinsic particle widths. In principle, they can be treated as free parameters and one can tune them using NN, BDT, etc. In our numerical study, we shall use $\sigma_{t}=5$ GeV and $\sigma_{W}=5$ GeV. Since there is a twofold ambiguity in the paring of a $b$ -quark and a lepton, Topness is defined as the smaller of the two $\chi^{2}$ s Kim:2018cxf ,

[TABLE]

The Topness distributions for both signal and backgrounds before baseline cuts are shown in Fig. 1 (3rd plot in the 3rd row). We observe that, as expected, $T$ tends to have smaller values for the main background ( $t\bar{t}$ ) than for signal.

In our signal of $hh$ production, the two $b$ -quarks arise from a Higgs decay ( $h\to b\bar{b}$ ), and therefore their invariant mass $m_{bb}$ can be used as a first cut to enhance the signal sensitivity. For the decay of the other Higgs boson, $h\to W^{\pm}W^{*\mp}$ , Higgsness is defined as follows Kim:2018cxf

[TABLE]

It tests whether the neutrino kinematics can be compatible with having the Higgs boson and one of the $W$ -bosons on-shell, while at the same time being consistent with the invariant mass distributions expected for the off-shell $W$ -boson, $W^{\ast}$ , and the neutrino pair, $\nu\bar{\nu}$ . The invariant mass $m_{W^{*}}$ is bounded by $0\leq m_{W^{*}}\leq m_{h}-m_{W}$ and the peak of its distribution is at

[TABLE]

The left panel of Fig. 2 shows the unit-normalized invariant mass distribution of the proper lepton-neutrino system ( $m_{\ell\nu}$ ). The distribution has a bimodal shape — the narrow peak on the right near 80 GeV corresponds to the on-shell $W$ -boson resonance, while the broader hump to the left is due to the off-shell $W^{\ast}$ , with a clear end-point at $m_{h}-m_{W}=45$ GeV and a maximum near $m_{W^{*}}^{peak}=40$ GeV in accordance with (7).

The definition of Higgsness (6) also includes a term which tests for consistency with the expected invariant mass distribution $\frac{\textrm{d}\sigma}{\textrm{d}m_{\nu\bar{\nu}}}$ for the neutrino pair333In the limit of massless leptons, the distribution $\frac{\textrm{d}\sigma}{\textrm{d}m_{\nu\bar{\nu}}}$ is the same as the dilepton mass distribution $\frac{\textrm{d}\sigma}{\textrm{d}m_{\ell^{+}\ell^{-}}}$ , which is directly observable and therefore more commonly discussed in the literature Han:2009ss ; Han:2012nr ; Han:2012nm ; Cho:2012er ., which is shown in the right panel of Fig. 2. The red solid curve gives the pure phase space prediction

[TABLE]

where $\lambda(x,y,z)=x^{2}+y^{2}+z^{2}-2xy-2yz-2zx$ is the two-body phase space function and $f(m)$ is the invariant mass distribution of the antler topology with $h\to WW^{*}\to\ell^{+}\ell^{-}\nu\bar{\nu}$ :

[TABLE]

where the endpoint $E$ and the parameter $\eta$ are defined in terms of the particle masses as

[TABLE]

Note that by allowing one of the $W$ -bosons to be on-shell, eqs. (8-13) generalize the results previously derived in Refs. Han:2009ss ; Han:2012nr ; Han:2012nm ; Cho:2012er for the purely on-shell case. The blue histogram in the right panel of Fig. 2 shows the actual $m_{\nu\bar{\nu}}$ distribution, whose shape is slightly different from the pure phase space result (8), due a helicity suppression in the $W$ - $\ell$ - $\nu$ vertex. In particular, we observe that the actual peak is at $m_{\nu\bar{\nu}}^{peak}\approx 30$ GeV, which is the value that we shall use in the definition of Higgsness (6).444We note that other variants of Higgsness are also possible — for example, instead of penalizing the function $H$ by the distances to the peaks in the corresponding distributions, one can introduce penalty terms which take advantage of the knowledge of the exact probability distributions (the blue histograms in Fig. 2).

The definition of Higgsness (6) contains some additional resolution parameters: $\sigma_{h}$ for the reconstructed mass of the Higgs boson, $\sigma_{W^{\ast}}$ for the reconstructed mass of the off-shell $W$ boson, and $\sigma_{\nu}$ for the $m_{\nu\bar{\nu}}$ resolution. In what follows, we shall take $\sigma_{W^{\ast}}=5$ GeV, $\sigma_{h_{\ell}}=2$ GeV, and $\sigma_{\nu}=10$ GeV.555We have checked that our results are not very sensitive to these choices.

The Higgsness distributions for both signal and backgrounds before baseline cuts are shown in Fig. 1 (4th plot in the 3rd line). The two dimensional map of (Higgsness, Topness) on a log-log scale is depicted in Fig. 3. The Higgsness and Topness distributions in Fig. 1 are projections of this two dimensional scatter plot onto the $x$ -axis and $y$ -axis, respectively. Although the signal and the backgrounds do not exhibit a very clean separation in the individual one-dimensional projections in Fig. 1, their two dimensional correlation plots show some visible differences. We note that even after employing the baseline cuts, one can still see a difference in the two dimensional correlation of Higgsness and Topness (bottom row plots).

Along with Higgsness and Topness, we also consider two versions of the $\hat{s}_{min}$ variable Konar:2008ei ; Konar:2010ma , which is defined as

[TABLE]

where $({\rm v})$ represents a set of visible particles under consideration, while $m_{\rm v}$ and $\vec{P}_{T}^{\rm v}$ are their invariant mass and transverse momentum, respectively. The variable (14) characterizes the system comprising of the visible particles $({\rm v})$ and the invisible particles (here assumed to be massless) which are responsible for the measured missing transverse momentum ${\;/\!\!\!\!\vec{P}_{T}}$ . It provides the minimum value of the Mandelstam invariant mass $\hat{s}$ for the system which is consistent with the observed visible 4-momentum vector. We shall apply (14) to the whole event, where ${\rm v}=\{bb\ell\ell\}$ , or to the subsystem resulting from the decay $h\to W^{\pm}W^{*\mp}\to\ell^{+}\ell^{-}\nu\bar{\nu}$ , where ${\rm v}=\{\ell\ell\}$ . The distributions of the resulting variables $\hat{s}_{min}^{(bb\ell\ell)}$ and $\hat{s}_{min}^{(\ell\ell)}$ are shown in the left two panels on the fourth row of Fig. 1. The $\hat{s}_{min}^{(bb\ell\ell)}$ variable represents the minimum energy required to produce the two original parent particles (the two Higgs bosons in the case of the signal and the two top quarks in the case of the major $t\bar{t}$ background). This is why one would expect the distribution to peak around the parent mass threshold, $2m_{h}$ for the signal and $2m_{t}$ for the background Konar:2008ei . However, the first panel in the fourth row of Fig. 1 shows that while the background $\hat{s}_{min}^{(bb\ell\ell)}$ distribution peaks near $2m_{t}$ , which is expected, the signal $\hat{s}_{min}^{(bb\ell\ell)}$ distribution peaks around 400 GeV, which is substantially higher than $2m_{h}$ . This implies that the two top quarks are produced more or less at rest, while the two Higgs bosons have a sizable boost. Similarly, the variable $\hat{s}_{min}^{(\ell\ell)}$ is the minimum energy required to produce the two $W$ bosons. For the $t\bar{t}$ background, where both $W$ bosons are on-shell, the peak is expected to occur around $2m_{W}$ . On the other hand, the signal distribution should be softer, since one of the $W$ bosons is off-shell, and furthermore, the peak should be located slightly below the Higgs boson mass. These kinematic differences are illustrated in the second plot on the fourth row of Fig. 1, and motivate the use of $\hat{s}_{min}^{(\ell\ell)}$ as an analysis variable.

The last two panels in the fourth row of Fig. 1 show distributions of the subsystem $M_{T2}$ variable Burns:2008va — first when it is applied to the $b\bar{b}$ visible system resulting from the $t\to bW$ decays ( $M_{T2}^{(b)}$ ), and then when it is applied to the $\ell^{+}\ell^{-}$ visible system resulting from the $W\to\ell\nu$ decays ( $M_{T2}^{(\ell)}$ ). In principle, $M_{T2}$ is defined as Lester:1999tx

[TABLE]

where the minimization over the transverse masses of the parent particles $M_{TP_{i}}$ ( $i=1,2$ ) is performed over the transverse neutrino momenta $\vec{p}_{\nu T}$ and $\vec{p}_{\bar{\nu}T}$ , subject to the ${\;/\!\!\!\!\vec{P}_{T}}$ constraint666See Refs. Barr:2011xt ; Kim:2017awi ; Cho:2014naa ; Konar:2009wn ; Konar:2009qr ; Baringer:2011nh ; Kim:2015uea ; Goncalves:2018agy ; Debnath:2017ktz for more information and other variants of $M_{T2}$ .. The parameter $\tilde{m}$ in (15) is the test mass for the daughter particle: in the case of $M_{T2}^{(\ell)}$ one should use $\tilde{m}=m_{\nu}=0$ , while in the case of $M_{T2}^{(b)}$ , the daughter particles are the $W$ bosons, and $\tilde{m}=m_{W}=80$ GeV, which leads to the lower bound $m_{W}\leq M_{T2}^{(b)}$ visible in the plot. By construction, the $M_{T2}$ variables are bounded by the mass of the corresponding parent particle. Indeed, the $M_{T2}^{(b)}$ distribution for $t\bar{t}$ production shows a sharp drop around $M_{T2}^{(b)}=m_{t}$ , while the signal distribution extends well above $m_{t}$ . Similarly, the $M_{T2}^{(\ell)}$ distribution for $t\bar{t}$ drops around $m_{W}$ , as expected. In addition, it exhibits a peak structure in the first bin, which is due to leptonic tau decays. This suggests that $M_{T2}^{(\ell)}$ can be effective in eliminating backgrounds with $\tau$ s.

This concludes our discussion of the 16 kinematic variables depicted in Fig. 1. The newly introduced 6 variables (Topness, Higgsness, $\hat{s}_{min}^{(bb\ell\ell)}$ , $\hat{s}_{min}^{(\ell\ell)}$ , $M_{T2}^{(b)}$ and $M_{T2}^{(\ell)}$ ) typically require a few extra steps to compute them, thus we shall refer to them as high-level kinematic variables, while the remaining 10 traditional variables will be called low-level kinematic variables. We will perform two independent analyses — one with and one without the high-level kinematic variables, in order to estimate the performance benefit from adding the additional 6 variables.

4 Color flow in signal and backgrounds

We note that the two $b$ -quarks in the signal result from the decay of a single non-colored object, the Higgs boson. In contrast, the two $b$ -quarks in $t\bar{t}$ production (which is the dominant background) arise from the decays of top quarks, which in turn are produced via the strong interactions from a gluon-gluon initial state. This distinction is pictorially illustrated in Fig. 4. The different color-flow Maltoni:2002mq will lead to different hadronization patterns, which can be used to discriminate a color singlet particle from a color octet (or triplet) at hadron colliders such as the LHC. Since the quarks which originate from a color singlet particle are color-connected to each other, their hadronization will not involve the initial state partons. On the contrary, the quarks which originate from a color octet particle are color-connected to the annihilating partons in the initial state, and consequently their hadronization is correlated with these initial state partons, see Fig. 4.

The difference in color flow will be reflected in the resulting hadron distributions. Hadrons coming from a color-singlet object will tend to be closer to the direction of the original mother particle, and as a result, the soft radiation will tend to populate the region between the two $b$ quarks. On the other hand, hadrons from the decay of a color-octet particle will not be so narrowly focused, due to the influence of the initial state partons. These features are illustrated in Fig. 5, where we show the cumulative $p_{T}$ distributions in the $(\eta,\phi)$ plane after showering the same partonic event 10,000 times. In the left panel we used a signal event, while in the right panel we used an event from $t\bar{t}$ production. We see that the b-jet clusters in the right panel tend to be better defined and more isolated, since they are not color-correlated among themselves. On the other hand, in the left panel we observe quite a bit of soft radiation in the region between the two $b$ jets, due to the existing color connection between them.

Of course, the results in Fig. 5 are only valid in the statistical sense, since we took the same parton-level event and hadronized it multiple times. In reality, only one instance of this hadronization will be realized, as illustrated in Fig. 6. The top row of plots shows the hadronization patterns for charged particles (left panel) and neutral particles (right panel) in the case of one signal event, while the bottom row shows the same, but for one $t\bar{t}$ event. The parton-level event information is quoted (in GeV) to the right of each row of panels, and then each event is translated in the ( $\eta,\phi$ ) plane until the origin is aligned with the direction of the $b$ -quark pair. The color scheme indicates the total $p_{T}$ in each pixel, while the dotted circles represent the $\Delta R=0.4$ cones for reconstruction of the corresponding $b$ jets.

As an alternative to Fig. 5, in Figs. 7 and 8, we illustrate the effects of color-connection by showing the average of the jet images for the signal and the different background processes before and after the baseline cuts, respectively (some basic generation-level cuts were imposed on the events in Fig. 7). The origin of the ( $\eta,\phi$ ) plane plane is taken to be the center of the $b$ quark pair and the color scheme indicates the total $p_{T}$ in each pixel. The black dotted line delineates the region $\-1.6\leq\eta\leq 1.6$ and $-2.01\leq\phi\leq 2.01$ used in the analysis. One can observe a striking difference in density between signal and background events in Fig. 7 — the two $b$ quarks tend to be more collimated in the signal and more spread out in the background.

Unfortunately, after imposing the baseline cuts introduced in Section 2, this distinction tends to be washed out and the backgrounds start mimicking the signal: one can see a similar structure emerging in all panels in Fig. 8, albeit with some subtle differences. Although one may find it difficult to discriminate signal from backgrounds simply by looking at a particular event, the patterns in the average jet images are different, and have been used actively for signal versus background separation Gallicchio:2010sw ; Hook:2011cq ; Aaboud:2018ibj . In this paper, instead of quantifying the difference (e.g., with a pull vector Gallicchio:2010sw ) we will use the images themselves on deep neural networks (DNNs), along with the 16 kinematic variables introduced in the previous section.

5 Analysis using deep learning

DNN is known to be very efficient and powerful in image recognition NIPS2012_4824 ; 0483bd9444a348c8b59d54a190839ec9 and the particle physics community has used it for various applications777The use of neural networks for data analysis in high energy physics can be traced back to the pioneering work by R. Field and his students in the mid-nineties Field:1996rw ; Field:1996nq ; RFtalk .. For instance, one can map the information about the direction and the energy (or transverse momentum) of a particle onto a pixel in an image. DNN then provides excellent classification between signal and background in the jet image Cogan:2014oua ; Komiske:2016rsd ; Kasieczka:2017nvn ; Lin:2018cin . It also shows performance gains in multrivariate analyses over traditional cut-and-count analyses or BDTs Baldi:2014kfa ; Baldi:2016fql . In this section, we describe how we organize our analysis in a DNN framework. In the following three subsections, we address the issues of data pre-processing, DNN architecture and training of the NN.

5.1 Data pre-processing

In order to achieve the improved DNN learning performance and to minimize the error, it is important to properly process signal and backgrounds events before feeding them into a DNN framework. For each event passing the baseline cuts, the jet images are processed as follows.

Input data: we use the particle flow for our input data CMS-PAS-PFT-09-001 . 2. 2.

Particle classification: we divide the particle flow into two groups: neutral particles and charged particles. Neutral particles include photons and neutral hadrons, while charged particles include charged hadrons. 3. 3.

Lepton removal: if there is a lepton, we remove it. 4. 4.

Shift: we shift all particle coordinates in the ( $\eta,\phi$ ) plane with respect to the center of the reconstructed $b$ -quark pair, i.e., we set $(\frac{\eta_{b}+\eta_{\bar{b}}}{2},\frac{\phi_{b}+\phi_{\bar{b}}}{2})$ as the new origin, (0,0). 5. 5.

Pixelization: we discretize the rectangular region in the ( $\eta,\phi$ ) plane defined by $-2.5\leq\eta\leq 2.5$ and $-\pi\leq\phi\leq\pi$ into a grid of $50\times 50$ pixels for each particle classification (charged particle set and neutral particle set). In each pixel, we record the total transverse momentum as the pixel’s intensity (in case of more than one particle, we add the transverse momenta and record the total sum). We refer to this $50\times 50$ discrete image as the jet image Cogan:2014oua . 6. 6.

Normalization: we rescale each jet image intensity as $I^{ij}\rightarrow I^{ij}/I^{\textrm{max}}$ , where $i,j=1,2,\ldots,50$ , and $I^{ij}$ represents the intensity value in the $(i,j)$ pixel. $I^{\textrm{max}}$ is defined to be the largest value of pixel intensity found in the two $50\times 50$ pixel images. 7. 7.

Cropping: we crop the jet image to $32\times 32$ pixels, by further restricting to the ( $\eta,\phi$ ) rectangular range of $-1.6\leq\eta\leq 1.6$ and $-2.01\leq\phi\leq 2.01$ .

The final jet image has dimension $2\times 32\times 32$ and is comprised of one charged particle channel with dimension $1\times 32\times 32$ and a neutral particle channel with dimension $1\times 32\times 32$ . This pre-processed jet-image is the input to the DNN. We note that Figs. 7 and 8 showed the combined $1\times 50\times 50$ jet-image obtained by adding the neutral and charged particle layers. The black dotted rectangular area in those figures showed the restricted $1\times 32\times 32$ pixel area.

5.2 DNN architecture

Our DNN architecture consists of three sub-architectures, which will merge later, as illustrated in Fig. 9. Combined deep learning (DL) is not yet very common888Combined DL is similar to ensemble learning Dietterich:2000:EMM:648054.743935 ., but recently there have been several studies in particle physics Lin:2018cin , as well as in other areas 7821017 ; 7477589 , which showed improved results over simple DL. In this subsection, we provide some details of our DNN layer architecture as follows:

Initialization. Since DNN has a lot of parameters, it is important to give non-biased initial values for the (weight, bias) before running DNN with all input data. We use the He uniform initialization method as in Ref. He:2015dtg , among several other algorithms for parameter initialization pmlr-v9-glorot10a ; DBLP:journals/corr/SaxeMG13 ; He:2015dtg . 2. 1.

Jet images. They are represented by the top panel in Fig. 9.

(a)

Input data: we use pre-processed jet images as inputs. 2. (b)

Convolutional neural networks layers: we use three layers of convolutional neural networks (CNN). Each layer has a $32\times 2\times 2$ filter with no stride and no padding. We proceed with the batch normalization process after filtering DBLP:journals/corr/IoffeS15 , using the ReLU function as our activation function pmlr-v15-glorot11a . After activation, we introduce the max pooling layer which has a $2\times 2$ shape with $2\times 2$ strides and padding. 3. (c)

Dense layers: we feed the output of the CNN into two fully connected $1\times 64$ dense layers, using ReLU as the activation function. 3. 2.

The 6 high level variables. Those are illustrated by the middle panel in Fig. 9.

(a)

Input data: $\sqrt{\hat{s}}_{min}^{(bb\ell\ell)}$ , $\sqrt{\hat{s}}_{min}^{(\ell\ell)}$ , $M_{T2}^{(b)}$ , $M_{T2}^{(\ell)}$ , Higgsness and Topness. 2. (b)

Dense layers: we introduce four fully connected $1\times 64$ dense layers with the ReLU activation function. All four layers have the batch normalization process before activation. 4. 3.

The 10 low level variables: Those are illustrated by the bottom panel in Fig. 9.

(a)

Input data: $p_{T\ell_{1}}$ , $p_{T\ell_{2}}$ , ${\;/\!\!\!\!{P}_{T}}$ , $m_{\ell\ell}$ , $m_{bb}$ , $\Delta R_{\ell\ell}$ , $\Delta R_{bb}$ , $p_{Tbb}$ , $p_{T\ell\ell}$ , $\Delta\phi_{bb,\ell\ell}$ . 2. (b)

Dense layers: we follow the same procedure as in the case with the 6 high level variables above. 5. 4.

Combination

(a)

Merge: we apply three single ( $1\times 1$ ) dense layers to the jet image, the 6 high level variables and the 10 low level variables. These layers are denoted as $\alpha$ , $\beta$ , and $\gamma$ , respectively, as shown in Fig. 9. To merge the three sub-architectures, we introduce the final dense layer of dimension $1\times 3$ without an activation function. 2. (b)

Final output: to distinguish signal from backgrounds, we apply a layer of dimension $1\times 2$ without an activation function.

5.3 DNN training

We now proceed with deep learning on the DNN architecture described in Sec. 5.2, using the pre-processed input data. We use Microsoft CNTK cntk as the main DNN library on GPU with an Nvidia CUDA platform. We use the Adam optimizer DBLP:journals/corr/KingmaB14 with cross entropy with SoftMax loss function and classification error function. The sizes of the training data set and the testing data set are about 40k and 17k, respectively. The size of the mini-batch is 128 and that of the epoch is 30.

For each event, we prepare the jet images and the 16 variables. The dimension of the final output is $1\times 2$ , ( $\mathcal{P}_{\textrm{sig}}$ , $\mathcal{P}_{\textrm{bknd}}=1-\mathcal{P}_{\textrm{sig}}$ ). If the deep learning score is equal to 1, i.e., $\mathcal{P}_{\textrm{sig}}=1$ ( $\mathcal{P}_{\textrm{bknd}}=0$ ), the corresponding event is taken to be a signal event. If $\mathcal{P}_{\textrm{sig}}=0$ ( $\mathcal{P}_{\textrm{bknd}}=1$ ), the event is considered to be background.

6 Results

In this section we present our results. First we validate our framework by repeating the analysis performed in Ref. Kim:2018cxf under similar assumptions.999Our current analysis has several notable improvements over the one carried out in Ref. Kim:2018cxf . First, the detector simulation is different — in the current study, we use Delphes, which assumes (on average) $\sim 90$ % ( $\sim 80$ %) reconstruction efficiency for leptons ( $b$ -jets), while Ref. Kim:2018cxf assumed 100% reconstruction efficiency for both. In addition, the Delphes detector resolution itself is slightly different from one used in Ref. Kim:2018cxf . In particular we find that the resolution of the missing transverse momentum is worse in Delphes and hence our current results are more conservative (if not more realistic). Finally, as mentioned earlier, we are now including $tW+j$ production, which turned out to be the next dominant background, yet was missing from all previous studies. These effects should be kept in mind when comparing our results here to previous results in the literature. We obtained consistent results for the conventional cut-and-count method with Delphes detector simulation. When we added deep learning, the signal significance improved slightly by 5-10%.

Now considering all relevant backgrounds and using all 16 variables and jet images, we show the deep learning score for the signal and the individual background processes in Fig. 10. The signal should peak near $\mathcal{P}_{\textrm{sig}}=1$ by construction, and indeed this is what is observed in the figure. Note that the $t\bar{t}$ and $tW$ processes are well separated from the signal and both peak near $\mathcal{P}_{\textrm{sig}}=0$ . This is direct consequence of the improvements made in our analysis — introducing the proper kinematic variables and jet images, which were meant to target the dominant background ( $t\bar{t}$ production), as evidenced in Figs. 1, 3 and 8. Although the subdominant backgrounds are also reduced in this process, they remain rather flat in Fig. 10.

The deep learning score shown in in Fig. 10 can now be used as a signal-to-background discriminator. By placing a lower cut and counting the number of surviving signal and background events, one obtains the efficiency curve (also known as a receiver operating characteristic (ROC)) shown in the left panel of Fig. 11. The curve contains several independent runs of deep learning and shows the signal efficiency ( $\epsilon_{\rm signal}$ ) versus the fraction of rejected background events, i.e., $1-\epsilon_{\rm bknd}$ , where $\epsilon_{\rm bknd}$ is the background efficiency. The efficiency corresponding to the results in Fig. 10 is shown with the red solid curve labeled “with jetimage DNN”. The other two solid lines show the efficiencies which would be obtained if we were to remove the jet images from the analysis: the purple solid curve (labelled “10var only DNN”) is obtained with the help of the 10 low-level kinematic variables, while the blue solid curve (labelled “16var only DNN”) shows the improvement when we add the 6 high-level variables and use the full set of 16 variables from Section 3, but still without jet images. The black dotted curve (labeled “jet image only DNN”) shows the result when we use jet images alone, with no help from any of the 16 kinematic variables. Finally, the blue dashed line (labelled “10var with jetimage DNN”) shows the result from an analysis combining jet images with the 10 low-level kinematic variables only. The corresponding signal significances are shown as a function of the number of events in the right panel of Fig. 11. Note that the right panel contains an additional curve (the purple dashed line labeled “10var only BDT”) where we use the 10 low-level variables and adopt a BDT algorithm using the TMVA tool kit Hocker:2007ht . The comparison of the latter line against the “10var only DNN” result (purple solid line) reveals the relative performance of DNN versus BDT.

In order to examine the effects of pile-up, we use several methods as follows. In the first method, we use the Soft Drop algorithm Larkoski:2014wba to remove soft jet activity which is exacerbated by pile-up. We set $\beta=0$ and $z_{\textrm{cut}}=0.1$ with $R=1.2$ anti- $k_{T}$ clustered fatjets. Then we select the closest fatjet to the $b\bar{b}$ momentum in the $\eta$ - $\phi$ plane and replace the particle flow data with the charged and neutral jet constituents of the selected fatjet. Soft Drop does not affect the jet images and retains the same shapes as in Fig. 8. In second method, we remove the neutral jet image layer in the analysis. Unlike charged particles, which can be cleaned up from pile-up relatively easily by checking the longitudinal vertex information Bertolini:2014bba , neutral particles cannot be treated the same way and suffer from non-removable pile-up effects. The corresponding results with these two pile-up mitigation methods are also shown in Fig. 11 with the red dotted line labelled “16var with jetimage DNN, SoftDrop” and the red, dashed line labelled “16var with jetimage DNN, no neutral layer”, respectively.

We also examine the performance of the DNN with four momentum information as input. The corresponding results are shown in Fig. 11, where the green-dashed (green-solid) curve represents the significance with four momentum information only (four momentum information plus jet images). The inputs are 18 real numbers, i.e., the four momenta of the two leptons and the two $b$ -tagged jets and the missing transverse momentum. For this exercise, we use a 4 $\times$ 128 dense layer instead of a 4 $\times$ 64 dense layer. We notice that the DNN performance with kinematic variables is better. This is because, in general, the use of four momenta requires a large training sample in order to be effective, while the kinematic variables already perform efficiently with a smaller data set. If the architecture is deep enough with a large amount of data, the DNN performance with four momentum information would be comparable (or better) to that with kinematic variables only. This exercise illustrates the importance of the appropriate use of kinematic variables.

In summary, Fig. 11 demonstrates that jet images (which capture the effects of color flow) can improve performance over the baseline selection cuts. At the same time, DL with jet image substructure alone does not show the best performance, and becomes fully effective (and still stable under pile-up) only when it is combined with the full set of 16 variables, including the high-level ones.

Table 1 summarizes the signal and background cross sections in fb at different stages of the analysis for the case of $N=20$ signal events. The last two columns show the signal significance $\sigma$ and the signal-to-background ratio $S/B$ . The significance is calculated using the log-likelihood ratio for a luminosity of 3 $\rm{ab}^{-1}$ at the 14 TeV LHC.

In order to understand the correlation between jet images and the 16 kinematic variables, we performed two independent runs with “jet images only; no kinematic variables” and “16 kinematic variables; no jet images”. The corresponding results are shown in Fig. 12. Since the two DLs are trained separately, both the $x$ -axis and the $y$ -axis are normalized to unity. As expected, Fig. 12 reveals a degree of correlation between the jet images and the 16 kinematic variables, which is somewhat stronger for the signal and less so for the background.

In our main analysis, we performed simultaneous runs as shown in the deep learning architecture in Fig. 9. Before calculating our final deep learning score, we obtain three intermediate values, $\alpha$ , $\beta$ , and $\gamma$ , which represent the DL scores for the respective substructure corresponding to the jet images, the 6 high level variables and the 10 low level variables. The first 6 panels in Fig. 13 show the pair-wise correlations between these three intermediate scores for the signal (top row) and the background (middle row). The bottom three panels in the figure show the one-dimensional distributions of the intermediate scores for signal (blue histograms) and background (red histograms). We observe that the score from jet images ( $\alpha$ ) is relatively uncorrelated to the kinematic variables scores $\beta$ and $\gamma$ , which motivates the simultaneous training on jet images and kinematic variables together.

Finally, in Fig. 14 we scan over different values of the triple Higgs coupling $\kappa_{3}$ and show the discovery significance (left panel) and precision (middle panel) as a function of $\kappa_{3}$ . Both the significance $\sigma$ and the precision $\Delta\chi^{2}$ are calculated fixing DL cuts that would give a certain number of signal events ( $N=15,20,25,30$ ) for the SM at $\kappa_{3}=1$ (marked with the dotted vertical line). For the significance, we used the log-likelihood-ratio

[TABLE]

where $S$ and $B$ are the expected number of signal and background events, respectively. We define $\Delta\chi^{2}$ as

[TABLE]

The shape of the significance roughly follows the cross section ratios between the case of $\kappa_{3}\neq 1$ to the case of $\kappa_{3}=1$ . This is illustrated in the rightmost panel of Fig. 14, which shows the cross section scaled as ${\sigma(\kappa_{3})}/{{\rm min}\big{(}\sigma(\kappa_{3})\big{)}}$ , i.e., normalized with respect to the minimum cross section for each curve. The blue curve represents the double Higgs production cross section before cuts, and in this case we find the minimum of the cross section somewhere between $\kappa_{3}=2$ and $\kappa_{3}=3$ . After baseline cuts (the red solid line), the minimum shifts to around $\kappa_{3}\sim 4$ , and after DL cuts (the green solid line), the minimum shifts even further out to around $\kappa_{3}\sim 5$ . In the latter case, we observe that the signal cross sections for $\kappa_{3}=1$ and $\kappa_{3}=8$ are numerically very close, as indicated by the two vertical dotted lines in the right panel. This provides an explanation for the double dip structure seen in the middle panel of Fig. 14.

As demonstrated in the rightmost panel of Fig. 14, the analysis cuts modify the signal cross section so that the location of its minimum shifts to higher values of $\kappa_{3}$ . This can be understood as follows. At leading order, the Higgs pair production cross section is given by

[TABLE]

before convoluting with the parton distribution functions Glover:1987nx ; Borowka:2016ypz . Here $F_{1}$ represents a parity-even triangle and box diagram contribution, while $F_{2}$ is a parity-odd box diagram contribution. Now $F_{1}$ can be rewritten as $F_{1}=\kappa_{3}F_{\triangle}+F_{\Box}$ , where $F_{\triangle}$ is the triangle diagram contribution and $F_{\Box}$ is the box diagram contribution. Therefore the cross section can be parameterized as a quadratic function of $\kappa_{3}$ , where the $c$ coefficients are related to contributions from $\triangle$ and $\Box$ diagrams.

The observation that the baseline cuts and the DL cut shift the minimum cross section to a larger $\kappa_{3}$ value implies that the effects of the cuts are stronger on $c_{\triangle}$ than $c_{\triangle,\Box}$ . In other words, our cuts are more likely to affect the triangle diagram which contains the triple Higgs coupling. Unlike the box diagram, the triangle diagram includes an off-shell Higgs in the $s$ -channel. Since it is harder to produce a Higgs pair from an $s$ -channel off-shell Higgs, the Higgs pair generated from the triangle diagram is not as energetic as the one coming from the box diagram, and will therefore tend to have lower transverse momentum. As discussed in Section 4, several of the cuts on our kinematic variables, namely, $\Delta R_{bb}$ , $\Delta R_{\ell\ell}$ , $\Delta\phi_{bb,\ell\ell}$ , $p_{Tbb}$ and $p_{T\ell\ell}$ , rely on the fact that the Higgs bosons are produced with a significant boost. Consequently, the effect of the cuts will be to suppress the $c_{\triangle}$ term and enhance the box diagram contribution, which in turn shifts the location of the minimum to a larger value of $\kappa_{3}$ .

Note that the results for the significance and the precision in Fig. 14 do not change dramatically when we require a different number of signal events at the SM point. This means that the dependence on the DL cut is relatively mild, since the kinematics remains similar when we vary $\kappa_{3}$ , so that the dependence on the cross section is more important. This is illustrated in Fig. 15, which shows the cross section (in pb) after cutting on the DL score, $\sigma_{DL}(pp\to hh\to bb\ell^{+}\ell^{-}\nu\bar{\nu})$ , (left panel) and the ratio $\sigma_{DL}/\sigma_{\rm baseline}$ between the cross section $\sigma_{DL}$ after the DL cut and the cross section $\sigma_{\rm baseline}$ after baseline cuts (right panel).

7 Discussion

In this paper, we investigated double Higgs production in the $hh\to bbWW^{*}\to bb\ell\ell+{\;/\!\!\!\!\vec{P}_{T}}$ final state. It is known to be one of the difficult channels due to the large backgrounds, ${\sigma_{\rm bknd}}/{\sigma_{\rm hh}}\sim 10^{5}$ . We performed a detailed analysis by adopting a deep learning framework and successfully combining new kinematic variables and jet image information. As a result, we obtained a sizable increase in signal sensitivity and an improved signal-to-background ratio compared to the existing analyses.

Our results showed that the dominant $t\bar{t}$ background can be brought down to the level of the other remaining backgrounds, without sacrificing too much in the signal rate. This is mostly due to the use of Higgsness, Topness and the subsystem variable $M_{T2}^{(b)}$ . Other backgrounds like $bb\tau\tau$ can be reduced further by the use of $M_{T2}^{(\ell)}$ . Finally, additional improvements are possible with the use of jet images. After all cuts, we find that all backgrounds contribute at similar levels.

We find from recent CMS and ATLAS analyses with 36 fb*-1* of LHC data at 13 TeV that the 95% confidence level observed (expected) upper limit on the production cross section is 22.2 (12.8) times the standard model value Sirunyan:2018two for CMS and 6.7 (10.4) times the predicted Standard Model cross-section ATLAS:2018otd for ATLAS. The leading channel in CMS is $bb\gamma\gamma$ followed by $bb\tau\tau$ , while the leading channels in ATLAS are $bb\tau\tau$ and $bbbb$ , followed by $bb\gamma\gamma$ . The main difference arises due to the superior $b$ -tagging efficiency for the ATLAS detector Aaboud:2018knk . In both studies, the $bbWW^{*}$ channel was largely overlooked due to the expected poor significance. However, our study suggests that double Higgs production may be probed in the dilepton $bbWW^{*}$ channel as well, and would contribute to the combined analysis on par with the other final states, increasing the overall significance. For example, in Ref. ATLAS:2018combi , the ATLAS collaboration showed that the combined significance of $hh\to bbbb$ , $hh\to bb\tau\tau$ , and $hh\to bb\gamma\gamma$ is 3.5 (3.0) without (with) systematic uncertainties at the 14 TeV LHC with 3 ab*-1*. Their individual significance is 1.4, 2.5 and 2.1 (0.61, 2.1 and 2.0), respectively without (with) systematics. They did not combine with the $hh\to bbWW^{*}$ channel but a naive estimate shows that when including our channel, the combined significance would be about 3.7.

We urge the experimental collaborations to consider the ideas presented in this paper and test them in the LHC data. We would also like to mention that the proposed method can be easily generalized to the semi-leptonic channel from $hh\to bbWW^{*}$ production, as well as to other processes with similar final states.

Acknowledgments

This work is supported in part by United States Department of Energy (DE-SC0010296, DE-SC0017988, DE-SC0017965, DE-SC0019474) and Korea NRF-2018R1C1B6006572. We thank KIAS for providing computing resources. We thank Georgios Anagnostou for valuable comments and Anja Butter, Tilman Plehn and Chris Rogan for general discussion on machine learning. KK is grateful to the Mainz Institute for Theoretical Physics, which is part of the DFG Cluster of Excellence PRIMA+ (Project ID 39083149), for its hospitality and its partial support during the completion of this work.

Appendix A Deep Neural Network

The artificial neural network (ANN) is one of the most popular approaches to pattern recognition in machine learning algorithms. The structure of an ANN is defined by a succession of non-linear and linear transformations between nodes or artificial neurons, which are located on input, output or hidden layers. A hidden layer which uses an ordinary one-dimensional layer is called a dense layer or a fully connected layer.

The linear operation consists of weights and bias:

[TABLE]

where $I[j]$ is the value of the $j$ -th neuron (input) in the prior layer, $O[i]$ the value of the $i$ -th neuron (output) in the subsequent layer, $\mathcal{W}[i,j]$ are the weights, and $\mathcal{B}[i]$ the bias. The index $i$ ( $j$ ) takes the values $0,\cdots,n_{O}-1$ ( $0,\cdots,n_{I}-1$ ) and $n_{O}$ ( $n_{I}$ ) is the dimension of the output (input). The input initially can be given in more than one dimension. For example, if the input results from a convolution and has dimension $n\times n$ , it may be rearranged as follows:

[TABLE]

where the corresponding dimension of the input would be $n_{I}=n^{2}$ .

The non-linear transformation is often called activation function, which imitates the action potential of biological neurons. Similar to how each neuron adjusts how much signal it needs to deliver to the next neuron using an electric action potential, the activation function determines the output of a particular neuron for a set of given inputs from neurons on the previous layer, and the output is then used as input for the next artificial neuron. The commonly used activation functions are

[TABLE]

where $x[i]$ represents the value of the $i$ -th neuron.

If the neural network has sufficiently many hidden layers, the network is called deep neural network (DNN). DNN can learn from the input data to obtain the desirable output by adjusting the parameters in the hidden layers. We note that the proper normalization of the input data helps improve convergence during training. The goal of the training is to determine the parameters (weights and biases) by minimizing the loss, which represents the difference between the target output and the actual DNN output. There are various algorithms for optimization of the parameters DBLP:journals/corr/KingmaB14 ; Duchi:2011:ASM:1953048.2021068 ; rmsprop . Some well known loss functions are

[TABLE]

where $\{t[i]\}$ is the true answer (either 1 or 0 in our current study), $\{x[i]\}$ is the DNN final output, and $n$ is the number of neurons in the output layer.

Instead of feeding the entire data into the DNN all at once, one splits the input data into several subsets with random selection and takes one subset, called mini-batch, for a given iteration, which helps avoid the over-fitting problem pmlr-v40-Ge15 ; DBLP:journals/corr/abs-1804-07612 . When the full training set is used, the cycle is called epoch, and one uses several epochs to obtain a well-trained DNN. When training DNN with a mini-batch, the corresponding loss is defined by the sum of all losses over the mini-batch or by their average.

Once the training is over, for testing one uses a different data set from the one used in the DNN training, in order to avoid the over-fitting problem. In order to test the trained DNN model one can use either the loss function or the classification error function. If the number of test events is $n$ , the classification error function is defined by

[TABLE]

where ArgMax( $\{y[i]\}$ ) gives the position $i_{\textrm{max}}$ where the value of $\{y[i]\}$ is maximized. $j$ represents the $j$ -th test event, the $\delta$ is Kronecker delta function.

Often one takes additional steps such as dropout for reducing over-fitting in neural networks Srivastava:2014:DSW:2627435.2670313 and batch normalization for improving the performance and stability of artificial neural networks DBLP:journals/corr/IoffeS15 . Dropout makes a random drop of units (both hidden and visible) in a neural network and is considered an efficient way of performing model averaging. The batch normalization procedure normalizes the input layer by adjusting and scaling the activations:

[TABLE]

where $\alpha$ represents the $\alpha$ -th input in a mini-batch and $n$ is the size of the mini-batch. The dimensions of input and output are the same. Note that ( $\gamma$ , $\beta$ ) are the learned parameters during the training and $\epsilon$ is a parameter added to avoid a divergence in the denominator. The batch normalization allows each layer of a network to learn by itself independently of the other layers.

A convolutional neural network (CNN) is a class of DNN, most commonly used to analyze images. CNN utilizes filters made of a set of neurons with a fixed size. The value of parameters in each filter is learned during the training process. By varying the position of the filters on the input and learning the values of different filters, CNN can find local features of the input data. This process is called convolution and a hidden layer which uses convolution is called a convolutional layer. With $n_{f}^{\prime}$ filters whose size is $(n_{fs}\times n_{fs})$ , the convolution is defined as follows

[TABLE]

where the dimension of the input is $n_{f}\times(n\times n)$ and the dimension of output is $n_{f}^{\prime}\times(n^{\prime}\times n^{\prime})$ . The corresponding ranges of the parameters are $k=\{0,\cdots,n_{f}^{\prime}-1$ }, $\alpha=\{i,\cdots,i+n_{f_{s}}\}$ , $\beta=\{j,\cdots,j+n_{f_{s}}\}$ , $i,j=\{0,n_{s},2n_{s},\cdots,n^{\prime}\}$ , $n^{\prime}=n/n_{s}-n_{fs}+n_{s}$ , and $n_{s}$ is called the stride.

Since each filter has a finite size, the output size decreases, after applying the convolution (32) on the input or on the output from a previous layer. In order to prevent the size reduction, CNN incorporates the padding process:

[TABLE]

which increases the size of the original input by adding zeros around it. Usually the padding is used before applying convolution or pooling.

CNN may include local or global pooling layers (often called sub-sampling), which combine the output of several neurons at one layer to a single neuron in the next layer. For example, max (average) pooling takes the maximum (average) value from a set of neurons at the previous layer and passes it to next layer. For a pooling dimension $n_{p}$ , the relation between the output with dimension $n_{f}\times(n^{\prime}\times n^{\prime})$ and the input with dimension $n_{f}\times(n\times n)$ is given by

[TABLE]

where $\alpha=\{i,\cdots,i+n_{p}\}$ , $\beta=\{j,\cdots,j+n_{p}\}$ , $k=\{0,\cdots,n_{f}-1\}$ , $i,j=\{0,n_{s},2n_{s},\cdots,n^{\prime}\}$ , $n^{\prime}=n/n_{s}-n_{p}+n_{s}$ , and $n_{s}$ is the stride.

Another beneficial feature of a CNN is the reduction of the number of parameters via convolution and pooling, which effectively increases the learning speed in deep neutral networks. A typical DNN architecture consists of a combination of convolutional layers and dense layers, which provides better performance compared to a NN with only one type of layers Krizhevsky:2012:ICD:2999134.2999257 .

Bibliography115

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1) ATLAS collaboration, G. Aad et al., Observation of a new particle in the search for the Standard Model Higgs boson with the ATLAS detector at the LHC , Phys. Lett. B 716 (2012) 1–29 , [ 1207.7214 ]. · doi ↗
2(2) CMS collaboration, S. Chatrchyan et al., Observation of a new boson at a mass of 125 Ge V with the CMS experiment at the LHC , Phys. Lett. B 716 (2012) 30–61 , [ 1207.7235 ]. · doi ↗
3(3) ATLAS, CMS collaboration, G. Aad et al., Measurements of the Higgs boson production and decay rates and constraints on its couplings from a combined ATLAS and CMS analysis of the LHC pp collision data at s = 7 𝑠 7 \sqrt{s}=7 and 8 Te V , JHEP 08 (2016) 045 , [ 1606.02266 ]. · doi ↗
4(4) ATLAS collaboration, Study of the double Higgs production channel H( t o 𝑡 𝑜 to bb)H( → γ γ → absent 𝛾 𝛾 \to\gamma\gamma ) with the ATLAS experiment at the HL-LHC , ATL-PHYS-PUB-2017-001 .
5(5) ATLAS collaboration, Projected sensitivity to non-resonant Higgs boson pair production in the bbbb final state using proton proton collisions at HL-LHC with the ATLAS detector , ATL-PHYS-PUB-2016-024 .
6(6) J. H. Kim, Y. Sakaki and M. Son, Combined analysis of double Higgs production via gluon fusion at the HL-LHC in the effective field theory approach , Phys. Rev. D 98 (2018) 015016 , [ 1801.06093 ]. · doi ↗
7(7) CMS collaboration, A. M. Sirunyan et al., Combination of searches for Higgs boson pair production in proton-proton collisions at s = 𝑠 absent \sqrt{s}= 13 Te V , Phys. Rev. Lett. 122 (2019) 121803 , [ 1811.09689 ]. · doi ↗
8(8) CMS collaboration, Higgs pair production at the High Luminosity LHC , CMS-PAS-FTR-15-002 .