Automating the Construction of Jet Observables with Machine Learning

Kaustuv Datta; Andrew Larkoski; Benjamin Nachman

arXiv:1902.07180·hep-ph·November 20, 2019

Automating the Construction of Jet Observables with Machine Learning

Kaustuv Datta, Andrew Larkoski, Benjamin Nachman

PDF

TL;DR

This paper presents a machine learning-based method to automatically construct simple, analytic jet observables that fully describe M-body phase space, improving particle classification tasks in high-energy physics.

Contribution

It introduces a novel automated procedure for building jet observables that specify M-body phase space, enabling tailored and more effective particle tagging.

Findings

01

Validated on distinguishing H→bb̄ from g→bb̄ with M=3.

02

Designed observables for boosted Z' search with M=4.

03

Outperformed standard 2-prong tagging methods.

Abstract

Machine-learning assisted jet substructure tagging techniques have the potential to significantly improve searches for new particles and Standard Model measurements in hadronic final states. Techniques with simple analytic forms are particularly useful for establishing robustness and gaining physical insight. We introduce a procedure to automate the construction of a large class of observables that are chosen to completely specify $M$ -body phase space. The procedure is validated on the task of distinguishing $H \to b \overset{ˉ}{b}$ from $g \to b \overset{ˉ}{b}$ , where $M = 3$ and previous brute-force approaches to construct an optimal product observable for the $M$ -body phase space have established the baseline performance. We then use the new method to design tailored observables for the boosted $Z^{'}$ search, where $M = 4$ and brute-force methods are intractable. The new classifiers…

Tables8

Table 1. Table 1: Summary of parameters for the product observables for ungroomed H → b b ¯ → 𝐻 𝑏 ¯ 𝑏 H\rightarrow b\overline{b} discrimination as proposed in Ref. Datta and Larkoski ( 2018 ) and as constructed via the procedures presented in this work (Figs. 2a and 2b ).

Observable	$a$	$b$	$c$	$d$	$e$	AUC
$β_{3}$	2.0	0.0	0.0	0.5	-1.0	0.823
$β_{3, H \to b \bar{b}}^{ML}$	1.87	-0.02	-0.14	0.66	-0.98	0.823
${\hat{β}}_{3, H \to b \bar{b}}^{ML}$	-0.11	-0.58	0.09	-0.25	0.51	0.824

Table 2. Table 2: Summary of parameters for β 4 ML superscript subscript 𝛽 4 ML \beta_{4}^{\text{ML}} for ungroomed Z ′ superscript 𝑍 ′ Z^{\prime} vs. QCD discrimination at 3 mass points.

$m_{Z^{'}}$ [GeV]	$a$	$b$	$c$	$d$	$e$	$f$	$g$	$h$
50	2.72	-3.78	0.63	-2.77	1.54	0.20	2.36	-0.28
90	0.90	-2.87	0.18	-1.78	-0.72	1.79	2.48	-0.44
130	1.69	-2.98	0.75	-0.89	-0.38	0.77	1.37	0.30

Table 3. Table 3: Summary of parameters for β ^ 4 ML superscript subscript ^ 𝛽 4 ML \hat{\beta}_{4}^{\text{ML}} for ungroomed Z ′ superscript 𝑍 ′ Z^{\prime} vs. QCD discrimination at 3 mass points.

$m_{Z^{'}}$ [GeV]	$a$	$b$	$c$	$d$	$e$	$f$	$g$	$h$
50	1.06	-1.11	0.25	-0.56	0.43	-0.07	0.22	-0.01
90	1.02	-1.06	0.22	-0.27	0.15	0.00	0.18	0.02
130	-1.09	-0.43	0.25	-0.97	0.37	0.12	0.60	0.19

Table 4. Table 4: Area under the ROC curve (AUC), from Fig. 5 , of the standard observables and the β 4 ML superscript subscript 𝛽 4 ML \beta_{4}^{\mathrm{ML}} observables, optimized for the corresponding signal, for ungroomed Z ′ superscript 𝑍 ′ Z^{\prime} vs. QCD discrimination at 3 m Z ′ subscript 𝑚 superscript 𝑍 ′ m_{Z^{\prime}} points. The ROC curves are calculated using the full datasets, with ∼ similar-to \sim 500,000 events passing the mass cut for each value of m Z ′ subscript 𝑚 superscript 𝑍 ′ m_{Z^{\prime}} .

$m_{Z^{'}}$ [GeV]	${\hat{β}}_{4}^{ML}$	$β_{4}^{ML}$	$N_{2}^{(1)}$	$D_{2}^{(1)}$	$τ_{2, 1}^{(1)}$
50	0.864	0.858	0.843	0.778	0.817
90	0.873	0.866	0.848	0.837	0.827
130	0.842	0.838	0.809	0.812	0.797

Table 5. Table 5: Summary of parameters for the product observables for groomed H → b b ¯ → 𝐻 𝑏 ¯ 𝑏 H\rightarrow b\overline{b} discrimination as proposed in Ref. Datta and Larkoski ( 2018 ) and as constructed via the procedure presented in this work (Fig. 7a ).

Observable	$a$	$b$	$c$	$d$	$e$	AUC
$β_{3}^{(g)}$	-2.0	0.0	0.0	-2.0	2.0	0.745
$β_{3, H \to b \bar{b}}^{ML (g)}$	0.67	-1.65	0.01	-1.90	2.07	0.744
${\hat{β}}_{3, H \to b \bar{b}}^{ML (g)}$	-1.54	1.01	-0.17	-0.15	0.16	0.758

Table 6. Table 6: Summary of parameters for β 4 ML(g) superscript subscript 𝛽 4 ML(g) \beta_{4}^{\text{ML(g)}} for mMDT groomed Z ′ superscript 𝑍 ′ Z^{\prime} vs. QCD discrimination at 3 mass points

$m_{Z^{'}}$ [GeV]	$a$	$b$	$c$	$d$	$e$	$f$	$g$	$h$
50	2.6	-0.41	-2.94	-2.79	0.20	0.93	-0.66	2.43
90	2.3	-1.35	-2.05	-1.64	-0.81	0.89	2.03	-0.44
130	0.80	-1.74	-0.28	-1.01	-0.38	0.56	0.82	0.69

Table 7. Table 7: Summary of parameters for β ^ 4 ML(g) superscript subscript ^ 𝛽 4 ML(g) \hat{\beta}_{4}^{\text{ML(g)}} for mMDT groomed Z ′ superscript 𝑍 ′ Z^{\prime} vs. QCD discrimination at 3 mass points

$m_{Z^{'}}$ [GeV]	$a$	$b$	$c$	$d$	$e$	$f$	$g$	$h$
50	-0.35	0.35	0.56	1.05	-0.17	-0.24	-0.34	0.51
90	0.26	-0.41	-0.39	-0.68	-0.15	0.11	0.25	0.42
130	1.28	0.54	0.35	1.09	0.09	-0.38	-1.06	-0.48

Table 8. Table 8: Area under the ROC curve (AUC), from Fig. 10 , of standard observables and the β 4 ML ( g ) superscript subscript 𝛽 4 ML g \beta_{4}^{\mathrm{ML(g)}} observables, optimized for the corresponding signal, for mMDT groomed Z ′ superscript 𝑍 ′ Z^{\prime} vs. QCD discrimination at 3 m Z ′ subscript 𝑚 superscript 𝑍 ′ m_{Z^{\prime}} points. The ROC curves are calculated using the full datasets, with ∼ similar-to \sim 300,000 events passing the mass cut for each value of m Z ′ subscript 𝑚 superscript 𝑍 ′ m_{Z^{\prime}} .

$m_{Z^{'}}$ [GeV]	${\hat{β}}_{4}^{ML(g)}$	$β_{4}^{ML(g)}$	$N_{2}^{(2)}$	$D_{2}^{(2)}$	$τ_{2, 1}^{(2)}$
50	0.830	0.826	0.796	0.803	0.780
90	0.822	0.821	0.780	0.796	0.763
130	0.814	0.811	0.769	0.791	0.751

Equations6

{τ_{1}^{(0.5)}, τ_{1}^{(1)}, τ_{1}^{(2)}, ..., τ_{M - 2}^{(0.5)}, τ_{M - 2}^{(1)}, τ_{M - 2}^{(2)}, τ_{M - 1}^{(1)}, τ_{M - 1}^{(2)}},

{τ_{1}^{(0.5)}, τ_{1}^{(1)}, τ_{1}^{(2)}, ..., τ_{M - 2}^{(0.5)}, τ_{M - 2}^{(1)}, τ_{M - 2}^{(2)}, τ_{M - 1}^{(1)}, τ_{M - 1}^{(2)}},

τ_{N}^{(β)} = \frac{1}{\sum _{i \in jet} p _{T, i} R ^{β}} i \in jet \sum p_{T, i} axes j min (Δ R_{j, i})^{β},

τ_{N}^{(β)} = \frac{1}{\sum _{i \in jet} p _{T, i} R ^{β}} i \in jet \sum p_{T, i} axes j min (Δ R_{j, i})^{β},

β_{M}^{ML} = (τ_{1}^{(0.5)})^{a} (τ_{1}^{(1)})^{b} (τ_{1}^{(2)})^{c} (τ_{2}^{(1)})^{d} \dots .

β_{M}^{ML} = (τ_{1}^{(0.5)})^{a} (τ_{1}^{(1)})^{b} (τ_{1}^{(2)})^{c} (τ_{2}^{(1)})^{d} \dots .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Automating the Construction of Jet Observables with Machine Learning

Kaustuv Datta

Department of Physics, ETH Zürich, 8093 Zürich, Switzerland

[email protected]

Andrew Larkoski

Physics Department, Reed College, Portland, OR 97202, USA

[email protected]

Benjamin Nachman

Physics Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA

[email protected]

Abstract

Machine-learning assisted jet substructure tagging techniques have the potential to significantly improve searches for new particles and Standard Model measurements in hadronic final states. Techniques with simple analytic forms are particularly useful for establishing robustness and gaining physical insight. We introduce a procedure to automate the construction of a large class of observables that are chosen to completely specify $M$ -body phase space. The procedure is validated on the task of distinguishing $H\rightarrow b\bar{b}$ from $g\rightarrow b\bar{b}$ , where $M=3$ and previous brute-force approaches to construct an optimal product observable for the $M$ -body phase space have established the baseline performance. We then use the new method to design tailored observables for the boosted $Z^{\prime}$ search, where $M=4$ and brute-force methods are intractable. The new classifiers outperform standard $2$ -prong tagging observables, illustrating the power of the new optimization method for improving searches and measurement at the LHC and beyond.

I Introduction

Effective identification of hadronic decays of boosted heavy particles like the top quark or $W$ , $Z$ and Higgs ( $H$ ) bosons is essential for analyses at the Large Hadron Collider (LHC). Jet substructure observables that identify specific discriminating information in the radiation pattern of jets originating from different particles are now necessary, both in the search for new physics and precision Standard Model (SM) measurements. As a result, there is an extensive literature developing observables and techniques for identifying boosted topologies to increase the efficacy of LHC analyses probing extreme regions of phase space Larkoski et al. (2017); Asquith et al. (2018).

Modern machine learning (ML) methods have emerged as useful tools for automating the creation of optimal observables for classification. These methods are particularly powerful for high-dimensional, low-level inputs such as fixed-length sets of four-vectors Pearkes et al. (2017), variable-length sets of four-vectors Komiske et al. (2018a), physics-inspired bases Komiske et al. (2018b); Erdmann et al. (2018); Datta and Larkoski (2017, 2018); Butter et al. (2018), images Cogan et al. (2015); Almeida et al. (2015); de Oliveira et al. (2016); Komiske et al. (2017); Barnard et al. (2017); Kasieczka et al. (2017); Dreyer et al. (2018); Lin et al. (2018); Fraser and Schwartz (2018); Chien and Kunnawalkam Elayavalli (2018); Macaluso and Shih (2018), sequences Guest et al. (2016); Egan et al. (2017); Andreassen et al. (2019); Fraser and Schwartz (2018), trees Cheng (2018); Louppe et al. (2019), and graphs Henrion et al. (2017). Some deep learning-based tagging schemes have already been demonstrated using collider data as well as with full detector simulations for top quark tagging Aaboud et al. (2018); CMS Collaboration (2017a), boson tagging Aaboud et al. (2018); CMS Collaboration (2018a), quark/gluon tagging ATLAS Collaboration (2017a); CMS Collaboration (2017b), and $b$ -jet tagging ATLAS Collaboration (2017b, c); CMS Collaboration (2018b); Sirunyan et al. (2018a). In addition to improving classification performance, ML techniques may also be able to make jet tagging more independent from simulation and robust to differences between simulation and data as well as between sideband and signal regions Metodiev and Thaler (2018); Komiske et al. (2018c); Metodiev et al. (2017); Dery et al. (2017); Cohen et al. (2018); Louppe et al. (2016); Shimmin et al. (2017); ATLAS Collaboration (2018). These and related techniques have also been proposed as more model-agnostic approaches to new particle searches Collins et al. (2018, 2019); Heimel et al. (2018); Farina et al. (2018); Hajer et al. (2018).

One of the key challenges with ML taggers is to identify what information the machine is using for classification. Understanding the origin of discrimination can lead to robustness when taggers are applied outside of the region they were trained, can result in new theoretical insight for other applications, and may produce new simple observables that capture most of the information. While there are many proposals for ML metacognition Cohen et al. (2018); de Oliveira et al. (2016); Lin et al. (2018); Komiske et al. (2018b, a); Datta and Larkoski (2018, 2017), one particularly powerful approach is to identify simple product observables that capture most of the information from an ML algorithm trained on the full phase space Datta and Larkoski (2018). This approach results in analytically tractable observables that can capture nearly all of the power of a more complicated algorithm, but are also very robust and insightful. One of the most challenging aspects of the approach presented in Ref. Datta and Larkoski (2018) is the fitting process for picking the optimal simple product observable.

In this paper, we describe a new procedure based on ML for automating the feature extraction originally presented in Ref. Datta and Larkoski (2018). This method is applied to derive an optimal product observable for discriminating $H\rightarrow b\overline{b}$ vs. $g\rightarrow b\overline{b}$ and the outcome is compared to the result of Ref. Datta and Larkoski (2018) which used a brute force approach. Having validated the method, a new classifier is developed to distinguish a $Z^{\prime}$ from generic quark and gluon jets. The phase space scan required in this later tagging task is too big for the brute force approach and therefore the automated method is required to find the optimal tagger. The resulting classifier has a simple form and is competitive with a tagger using high-dimensional, low-level inputs. In addition to Ref. Shimmin et al. (2017), this is the only other study of the dependence on the mass of the new boson, which is timely given new searches for light boosted bosons Sirunyan et al. (2017); Sirunyan et al. (2018b); Aaboud et al. (2019).

This paper is organized as follows. The method for constructing product observables is described in Sec. II and the machine learning approaches are detailed in Sec. III. Results for both the Higgs and $Z^{\prime}$ classification tasks are presented in Sec. IV. The paper ends with conclusions and future outlook in Sec. V.

II $N$ -subjettiness Product Observables

The information about the kinematic phase space of $M$ -subjets in a jet is resolved with a set of $(3M-4)$ $N$ -subjettiness Stewart et al. (2010); Thaler and Van Tilburg (2011, 2012) observables. By increasing $M$ , one can identify the number of subjets required to saturate the classification performance based on the spanning set of $N$ -subjettiness observables Datta and Larkoski (2017):

[TABLE]

where

[TABLE]

for some choice of $N$ axes within the jet; $R$ is the jet radius parameter, and $(\Delta R)^{2}=(\Delta\phi)^{2}+(\Delta\eta)^{2}$ . Given the minimal $M$ , one can posit an ansatz111The product form may not be flexible enough to capture the full discrimination power. We find that it can capture a significant portion of the classification performance, but Appendix E indicates that further information can be useful. for a simple product observable that captures most of the information contained in a neural network trained on the entire spanning set:

[TABLE]

For distinguishing $H\rightarrow b\overline{b}$ vs. $g\rightarrow b\overline{b}$ jets, Ref. Datta and Larkoski (2018) showed that the useful information for classification is saturated by $M=3$ and $\beta_{3}^{\text{ML}}$ has nearly the same tagging performance as the full $3$ -body phase space. The parameters $a,b,c,d,e$ that specify $\beta_{3}^{\text{ML}}$ were identified by randomly scanning the five-dimensional phase space and exploiting minimal correlations between some of the parameters. This becomes intractable when the optimal $M$ is bigger than $3$ .

In this paper, we explore methods to overcome the difficulties of extending this procedure to higher dimensions. In one approach, we replace the random sampling segment of the procedure with a combination of neural networks carrying out regression from the parameter space to the distributions of the product observable for individual jets. Off-the-shelf minimization routines can then be used to optimize any metric of the classifier performance. A complementary and simpler approach is to directly use the form in Eq. 2 in the machine learning optimization, where the learnable parameters are the exponents $\{a,b,c,...\}$ . Further details are described in the next sections.

III Machine Learning

Implementation

III.1 Dataset

Proton-proton collisions with $Z^{\prime}\rightarrow\text{hadrons}$ , $H\rightarrow b\bar{b}$ , and generic quark and gluon jets (QCD) at $\sqrt{s}=13$ TeV are generated using Pythia 8.226 Sjostrand et al. (2006, 2015). For the $H\rightarrow b\bar{b}$ case, the background is enriched in $g\rightarrow b\bar{b}$ as in Ref. Alwall et al. (2014) by generating the gluon splitting matrix element in MadGraph 5 v2.5.4 Alwall et al. (2014). All detector-stable particles excluding neutrinos and muons are clustered into jets using the anti- $k_{t}$ algorithm Cacciari et al. (2008) with $R=0.8$ as implemented in Fastjet Cacciari et al. (2012). Jets are groomed by reclustering the constituents using the Cambridge-Aachen algorithm Dokshitzer et al. (1997); Wobisch and Wengler (1998) and applying the soft drop algorithm Larkoski et al. (2014a) with $\beta=0$ and $z_{\text{cut}}=0.1$ (equivalent to modified mass drop tagging or mMDT Dasgupta et al. (2013)). The $N$ -subjettiness observables are computed using the axes that minimize $\tau_{N}^{(\beta)}$ , using the exclusive $k_{t}$ algorithm Catani et al. (1993); Ellis and Soper (1993) with standard $E$ -scheme recombination Blazey et al. (2000). For comparison with other state-of-the-art two-prong tagging techniques, the $D_{2}$ Larkoski et al. (2014b), $N_{2}$ Moult et al. (2016) observables, and $\tau_{21}^{(\beta)}$ with winner-take-all (WTA) recombination Bertolini et al. (2014); Larkoski et al. (2014c); Larkoski and Thaler (2014), are also computed from the jet constituents.

III.2 Construction of optimized product observables

Using the approach followed in Ref. Datta and Larkoski (2018), the point of saturation of discrimination power is first identified using a deep neural network (DNN) classifier. For $Z^{\prime}$ vs. QCD and $H\rightarrow b\overline{b}$ vs. $g\rightarrow b\overline{b}$ discrimination, we note that discrimination power saturates at 4-body (8-dimensional) and 3-body phase space (5-dimensional), respectively. Then it is simple to form the product observable from the elements of the $M$ -body basis corresponding to saturation.

We examine two approaches for finding the optimal product observable. The first approach follows a similar method as the brute-force algorithm. Neural networks approximate signal and background probability distributions conditioned on the parameters $\{a,b,c,...\}$ and then any automated optimization procedure can be used to identify the best exponents. For each task, the product observable is calculated for 25,000 signal and background jets for different values of the parameters [ $a-e$ ] ( $H\rightarrow b\overline{b}$ ) or [ $a-h$ ] ( $Z^{\prime}$ ), in the range $[-5,5]$ . These distributions are then stored to generate training sets for the neural networks used to carry out regression from the parameter space to the calculation of $\beta_{M}^{\text{ML}}$ with those exponents.

While there are multiple possibilities for learning the probability distribution of $\beta_{M}$ given $\{a,b,c,...\}$ , such as generative adversarial networks Goodfellow et al. (2014) and variational autoencoders Kingma and Welling (2013); Rezende et al. (2014), the method that we found works well for the product observables is illustrated in Fig. 1. The network takes as input 5 (Higgs) or 8 ( $Z^{\prime}$ ) inputs and outputs 25,000 numbers, which represent a dataset that is the same size as the training data, but with the specified parameter values $\{a,b,c,...\}$ . From these 25,000 values, the probability distribution of $\beta$ is formed for signal and background and the one-dimensional likelihood ratio is constructed for optimizing the classifier performance. Variations on this setup are possible, such as (significantly) reducing the number of points needed to specify the probability distributions, but this approach was found to be robust to perturbations in initialization and network architecture. For this paper, it was found that the network did not work well with fewer than 25k example jets per parameter point. For each network, 250k (450k) parameter points were used for training in the $Z^{\prime}$ and ungroomed Higgs (groomed Higgs) case. In only the groomed Higgs case, a single network was trained for signal and background with a 1/0 switch added to the input. Separate networks were trained for signal and background in the $Z^{\prime}$ and ungroomed Higgs cases. To reduce the effects of numerical instability on the training of these networks, we train on samples after taking the natural logarithm of the 25k measured values of the product observables.

Aside from the use (or not) of the switch input, both the $H\rightarrow b\overline{b}$ and $Z^{\prime}$ tasks use simple fully-connected neural networks with two hidden layers. The input layer is followed by a dense layer with either 250 or 500 nodes, then another dense layer with 100 or 250 nodes, followed by an output layer with 25,000 nodes using a linear activation. The number of nodes in the hidden layers were bigger for the $Z^{\prime}$ case with grooming compared with the Higgs case or the ungroomed $Z^{\prime}$ case.

We use leaky rectified linear units (Leaky ReLU) as the activations for the hidden layers. The networks were compiled with a mean squared error loss function (on the penultimate layer shown in Fig. 1, not on $p(\beta_{M})$ directly), using Adam optimization Kingma and Ba (2014). The regression networks were each trained for $\sim 10,000$ epochs. All deep learning tasks were carried out with the Keras Chollet (2015) deep learning libraries, using the TensorFlow Abadi et al. (2015) backend.

Given the set of $25,000$ values of the $\beta_{M}$ observable for a given set of parameters, it is straightforward to use these networks in an optimality scan. For this purpose, we use SciPy’s Jones et al. (2001–) basin-hopping Wales and Doye (1997) global minima finder using the non-linear, derivative free COBYLA (Constrained Optimization BY Linear Approximation) Powell (1994) minimizer to scan over local minima. In the optimization, the networks are used to predict background and signal distributions for a given set of parameters. The 1-dimensional binned likelihood distributions222In principle, one can estimate the AUC without binning, but it was found that there was not a significant sensitivity to the choice of binning. of the observable, constructed from the network outputs, was then used to calculate the area under the ROC curve (AUC) to estimate the discrimination power, where (1-AUC) was explicitly chosen as the metric for the basin-hopping minimization. Appendix A illustrates that the regression networks can be used to accurately model the dependence of the AUC as a function of the parameters. The observable selected using this procedure will be denoted $\beta_{3,H\rightarrow b\bar{b}}^{\mathrm{ML}}$ in the next sections.

We also note that the space of possible inputs is degenerate since a monotonic function of an observable has the same discrimination power as the original observable. However, due to the finite binning required to calculate the AUC’s from the likelihood distributions, and statistical fluctuations in a given data sample, the observables do not have precisely the same power as monotonic functions of themselves. The issue of degeneracies is not explicitly dealt with in the minimization procedure, but if the networks are adequately trained over the input space, it is sufficient to locate any one ‘global’ minimum among local minima of similar depth, using basin-hopping or any other global minimizer.

A second approach to optimizing $\{a,b,c,...\}$ directly uses Eq. 2. The product form can be used directly as a tunable function for predicting signal/background with tunable parameters $\{a,b,c,...\}$ . This is a more direct way of identifying the optimal solution without explicitly modeling the probability distributions. Optimizing a generic function is possible with methods like stochastic gradient decent, but the product observable is amenable to a significant simplification333We thank Eric Metodiev for this insightful observation.. In particular, two classifiers that are monotonic transformations of each other result in the same classification performance. By taking the logarithm of Eq. 2, one can transform the problem into linear regression444Linear regression was proven to be sufficient for all IRC safe observables Ref. Komiske et al. (2018b), however our results need not be IRC safe. where the inputs are $\log(\tau)$ and the coefficients are the exponents. This approach uses the mean squared error loss to identify $\{a,b,c,...\}$ . The observable selected using this procedure will be denoted $\hat{\beta}_{3,H\rightarrow b\bar{b}}^{\mathrm{ML}}$ in the next sections.

In the limit of infinite data and an arbitrarily flexible neural network, both the ensemble learning and linear regression approaches should achieve the same performance. The latter is significantly easier to train, but the complex approach may provide additional benefits because by providing access to the probability distributions, one can optimize any performance metric directly. This includes batch-level losses like the AUC, false positive rate at a fixed true-positive rate, etc. The mean squared error loss should be sufficient to optimize all of these metrics, but maybe prevented from reaching the desired optimum due to limited training statistics. In practice, we do not find this to be the case with the setup presented here, but the structure may be useful for related tasks in the future.

IV Results

In this section, we present the new observables obtained for the different classification tasks for the ungroomed $Z^{\prime}$ samples (the groomed case is in Appendix C). For closure, we first demonstrate that this new procedure produces an observable for ungroomed $H\rightarrow b\overline{b}$ discrimination with the same performance as the $\beta_{3}$ observable proposed in Ref. Datta and Larkoski (2018) (the groomed case in Appendix B). Then we extend the procedure to higher $M$ -body phase space by applying it to $Z^{\prime}$ discrimination for three values of $m_{Z^{\prime}}$ , and propose new observables for those classification tasks.

IV.1 Ungroomed $H\rightarrow b\overline{b}$ vs. $g\rightarrow b\overline{b}$ discrimination

Utilizing the result that discrimination power for ungroomed $H\rightarrow b\overline{b}$ vs. $g\rightarrow b\overline{b}$ discrimination saturates at 3-body phase space, we use the procedures proposed in the previous section to find the optimal product observable. The final values for the parameters $\{a,...,e\}$ obtained through the optimization are presented in Table 1, along with those obtained in the previous study. Interestingly, the exponents with the ensemble method are nearly the same for $a$ , $b$ , $d$ , and $e$ , but slightly different for $c$ . For the regression method, the exponents are nearly the same as the ensemble method up to a constant factor (approximately $-2$ ) for $c$ , $d$ , and $e$ , but not for $a$ and $b$ . These results indicate the presence of multiple observables with comparable performance.

In Fig. 2a, we plot the distributions of the new observable computed for signal and background, along with the prediction from the ensemble neural network. We note that the network provides a good match to the true distribution, where the latter is also calculated on 10 times more jets. Further, in Fig. 2b we plot the distributions of the observable obtained via the ML regression method. We then compare the ROC curves for the new observables to $D_{2}^{(2)}$ Larkoski et al. (2014b), $N_{2}^{(2)}$ Moult et al. (2016) observables, and $\tau_{21}^{(2)}$ in Fig. 2c.

In addition, we also compare the new observables to $\beta_{3}$ in Fig. 2d to demonstrate that the three observables have essentially the same discrimination power as expected. Then, this allows us to proceed to applying the procedure on higher dimensional problems.

IV.2 Ungroomed $Z^{\prime}$ vs. $\mathrm{QCD}$

We first train neural network classifiers on the $M$ -body $N$ -subjettiness bases, to identify the point of saturation of discrimination power for each value of $m_{Z^{\prime}}$ .555A single neural network architecture, consisting of seven fully connected (five hidden) layers, was utilized for all of the classification tasks. The first four Dense layers consisted of 1000, 1000, 750 and 500 nodes respectively, and were assigned a Dropout Srivastava et al. (2014) regularization of 0.2, to prevent over-fitting on training data. The next two Dense layers consisted of 250 nodes with Dropout regularization 0.1, and 100 nodes without Dropout. The input layer and all hidden layers utilized the ReLU activation function Nair and Hinton (2010), while the output layer, consisting of a single node, used a sigmoid activation. The network was compiled with the binary cross-entropy loss minimization function, using the Adam optimization Kingma and Ba (2014). Models were trained with Keras’ default EarlyStopping callback, with appropriate patience thresholds, to further negate possible over-fitting. The results are presented in Fig. 3, showing that saturation occurs with the 4-body phase space for each case.

We then proceed to construct the $\beta_{4,Z^{\prime}}^{ML}$ and $\hat{\beta}_{4,Z^{\prime}}^{ML}$ product observables with the elements of the 8-dimensional 4-body basis, and run the procedure described in Sec. III and construct the new observables optimized for $Z^{\prime}$ discrimination at three different values of $m_{Z^{\prime}}$ .

We present the distributions of the new observables for $Z^{\prime}$ discrimination in Fig. 5 and then compare their discrimination power to standard observables and DNN’s trained on the spanning $N$ -subjettiness bases in Fig. 5. The corresponding values of $\{a,b,c,...,h\}$ and the AUCs are in tables 2, 3 and 4, respectively. The comparison of the true and predicted distributions in Fig. 5 illustrates the excellent quality of the regression network. The ROC curves in Fig. 5 show that the learned $\beta^{\text{ML}}$ and $\hat{\beta}^{\text{ML}}$ outperform the state-of-the-art single physics-motivated observables (top row), though the product observables do not fully saturate the performance of the DNN trained on the full $4$ -body phase space (bottom row). This suggests that a more flexible form (other than a simple product) is required to build a simple observable to capture more of the classification information. The product values obtained from the ensemble and regression methods are not simple scaling of each other, though the fact that both have a similar performance suggests that one is a monotonic transformation of the other.

The optimized $\beta^{\text{ML}}$ and $\hat{\beta}^{\text{ML}}$ observables are not identical for the different values of $m_{Z^{\prime}}$ (tables 2 and 3), but it would be interesting to study to what extent the trends are physical or are due to the existence of multiple observables with similar performance. We leave this study to future work. However, a first indication that the observables contain similar physical information is studied in Appendix D, where the optimized product for one mass is applied to another mass. The ROC curves are similar for all three product observables when applied to the same $m_{Z^{\prime}}$ .

V Conclusions

This paper has extended the growing literature of machine-learning assisted jet substructure-based tagging in two ways. First, we have developed a procedure to automatically identify the optimal product observable, using the $N$ -subjettiness features as an example. This is an important innovation because observables with relatively simple analytic forms are robust complements to complex neural network classifiers and prior to this work, there was no efficient way to identify the best coefficients in the product. Second, we have used this automated framework to identify the optimal product observables for searching for boosted resonances like the $Z$ boson, but with beyond the standard model masses. Jet substructure has proven to be a powerful toolset for such searches, but until now, there has been few studies of the mass dependence of the optimal observables.

Future extensions of the methods introduced in this paper may be able to simplify the regression procedure, as well as study the connections between different classifiers with similar performance (including the ones connected by monotonic functions). The power of the method may also be extended by considering other parametric forms besides products. Classification problems demanding a higher $M$ -body phase space are a natural extension of the examples presented here.

As machine learning techniques are used more widely to guide the optimal selection of classifiers, there will be a growing need to simplify and interpret the guidance from the machines. We have prepared an automated approach to construct optimal observables with simple, analytic forms, which can be used for further theoretical and experimental studies. This technique will form the basis of multiple extensions in the future to improve classification performance and increase the robustness of searches and measurement at the LHC and beyond.

Acknolwedgements

This work was supported by the U.S. Department of Energy, Office of Science under contract DE-AC02-05CH11231. We would like to thank Gregor Kasieczka, Patrick Komiske, Eric Metodiev, and Jesse Thaler for detailed comments on the analysis and manuscript.

Appendix A Crosscheck for performance of the Regression networks

Here we briefly demonstrate that the regression DNNs do actually learn to approximate the mapping from the input parameters of the product observables to their densities, i.e., a mapping from $\mathbb{R}^{8}\to\mathbb{R}^{25,000}$ . We specifically choose the ungroomed 90 GeV case, and choosing values of $\{a,...,h\}$ for the optimal observable, as listed in Table 2.

We then select one of the parameters and vary it between $-7$ and 7 with a step size of 0.1 while keeping the other parameters fixed. This allows us to study how the networks can be used to interpolate AUC’s over a range of values around the optimum we locate and, in addition, by going beyond the training range of $[-5,5]$ we also demonstrate that the networks can be used to extrapolate the aforementioned mapping to then still calculate the AUC with a good level of accuracy. The results for this study are shown in Fig. 6 and indicate that the regression networks allow to accurately track the trajectories of the AUC in these one-dimensional slices of the parameter space.

Appendix B Groomed $H\rightarrow b\overline{b}$ vs. $g\rightarrow b\overline{b}$ discrimination

Utilizing the result that discrimination power for mMDT groomed $H\rightarrow b\overline{b}$ vs. $g\rightarrow b\overline{b}$ discrimination saturates at 3-body phase space Datta and Larkoski (2018), we use the procedure proposed in the Sec. III to find the optimal product observable. The final values for the parameters $\{a,...,e\}$ obtained through the optimization are presented in Table 5, along with those obtained in the previous study. Interestingly, the exponents for $\beta_{3,H\rightarrow b\bar{b}}^{\mathrm{ML(g)}}$ are nearly the same for $c$ , $d$ , and $e$ , but are quite different for $a$ and $b$ . The factors $d$ and $e$ are also similar for $\hat{\beta}_{3,H\rightarrow b\bar{b}}^{\mathrm{ML(g)}}$ up to a multiplicative factor.

In Fig. 7a, we plot the distributions of the new observable computed for signal and background, along with the prediction from the neural network. We note that the network provides a good match to the true distribution, where the latter is also calculated on 10 times more jets. We then compare the ROC curves for the new observable to $D_{2}^{(2)}$ Larkoski et al. (2014b), $N_{2}^{(2)}$ Moult et al. (2016) observables, and $\tau_{21}^{(2)}$ in Fig. 7c.

In addition, we compare the new observable to $\beta_{3}^{(g)}$ in Fig. 7d to demonstrate that both observables have essentially the same discrimination power as expected. Then, this allows us to proceed to applying the procedure on higher dimensional problems. Further, we plot the ROC curve for the 4-body product observable from the linear regression method, noting that it provides the best performance of the observables that have been explored for this problem. 666Explicitly, the optimal parameter values for $\hat{\beta}_{4,H\rightarrow b\bar{b}}^{ML(g)}$ are as follows: $\{a,...,h\}=\{-2.09,1.46,-0.31,-0.49,0.35,0.03,-0.18,0.23\}$ , and it leads to an AUC of 0.778 in Fig. 7c.

Appendix C Groomed $Z^{\prime}$ vs. $\mathrm{QCD}$

In this section we carry out the same set of studies for mMDT groomed $Z^{\prime}$ discrimination as for the ungroomed cases from Sec. IV.2. As in the ungroomed case, Fig. 8 indicates that the saturation of discrimination power occurs at 4-body phase space.

The results for the final observables for the three $m_{Z^{\prime}}$ points are presented in tables 6 and 7, and the observable distributions are plotted in Fig. 10. The performance of the new observables are compared to standard ones and $M$ -body DNN’s in Fig. 10 and the corresponding AUCs are shown in Table 8 for different mass points. The conclusions from this section are qualitatively the same as from Sec. IV.2, with a slightly lower AUC from both the product observable and the physics-motivated observables. Importantly, the product observables for the groomed case appear to saturate the bounds from the $M$ -body phase space better than in the ungroomed case.

Appendix D Mass dependence of $\beta_{M}^{\text{ML}}$

Here, we briefly study the performance of the new observables presented in Sec. IV.2. They are tested on a different combination of signal and background samples from the ones they were optimized on; for example, we calculate the new observable for $m_{Z^{\prime}}=130$ GeV on signal samples for $m_{Z^{\prime}}=90$ GeV, and background, that pass the mass window on which the 90 GeV observable was optimized. The results for this study are presented in Fig. 11, and indicate that while these observables are optimized on samples from a specific mass point, they can be applied to other classification tasks and still provide better discrimination performance than standard observables. This also suggests that the different parameter sets in tables 2 and 3 may represent observables with very similar physical information even though the $N$ -subjettiness variables are not invariant under transverse boosts.

Appendix E Saturating the discrimination power of $\hat{\beta}_{M}^{\text{ML}}$

In this section we briefly study the flexibility of the product form ansatz using the $\hat{\beta}_{M}^{ML}$ observables obtained via the linear regression procedure. For concreteness, we look at the $m_{Z^{\prime}}=90$ GeV case, and plot ROC curves for the product observables upto $M=8$ in Fig. 12.

We observe that discrimination power gradually increases up to the inclusion of 7- or 8-body phase space variables. Compared to the ROC curve at the point of saturation, from the 4-body DNN classifier, these results suggest that while a DNN can adjust thresholds on the $M$ -body inputs such that there is effectively only redundant discriminating information in higher $M$ -body bases, as is also expected from the physics study in Ref. Datta and Larkoski (2017), the product observables do still benefit from including $N$ -subjettiness variables from beyond the point of saturation.

Depending on the classification task, the product observables may even come very close to matching the performance of a saturated ML classifier (Fig. 10). However, ultimately it cannot not capture all available information, due to lack of further flexibility of the product form ansatz. These observations will of course vary based on the objects being studied. We leave further physics studies of the product form or other equivalent ansatz to future work.

Bibliography82

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Larkoski et al. (2017) A. J. Larkoski, I. Moult, and B. Nachman (2017), eprint 1709.04464.
2Asquith et al. (2018) L. Asquith et al. (2018), eprint 1803.06991.
3Pearkes et al. (2017) J. Pearkes, W. Fedorko, A. Lister, and C. Gay (2017), eprint 1704.02124.
4Komiske et al. (2018 a) P. T. Komiske, E. M. Metodiev, and J. Thaler (2018 a), eprint 1810.05165.
5Komiske et al. (2018 b) P. T. Komiske, E. M. Metodiev, and J. Thaler, JHEP 04 , 013 (2018 b), eprint 1712.07124.
6Erdmann et al. (2018) M. Erdmann, E. Geiser, Y. Rath, and M. Rieger (2018), eprint 1812.09722.
7Datta and Larkoski (2017) K. Datta and A. Larkoski, JHEP 06 , 073 (2017), eprint 1704.08249.
8Datta and Larkoski (2018) K. Datta and A. J. Larkoski, JHEP 03 , 086 (2018), eprint 1710.01305.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Automating the Construction of Jet Observables with Machine Learning

Abstract

I Introduction

II NNN-subjettiness Product Observables

III Machine Learning

III.1 Dataset

III.2 Construction of optimized product observables

IV Results

IV.1 Ungroomed H→bb‾H\rightarrow b\overline{b}H→bb vs. g→bb‾g\rightarrow b\overline{b}g→bb discrimination

IV.2 Ungroomed Z′Z^{\prime}Z′ vs. QCD\mathrm{QCD}QCD

V Conclusions

Acknolwedgements

Appendix A Crosscheck for performance of the Regression networks

Appendix B Groomed H→bb‾H\rightarrow b\overline{b}H→bb vs. g→bb‾g\rightarrow b\overline{b}g→bb discrimination

Appendix C Groomed Z′Z^{\prime}Z′ vs. QCD\mathrm{QCD}QCD

Appendix D Mass dependence of βMML\beta_{M}^{\text{ML}}βMML​

Appendix E Saturating the discrimination power of β^MML\hat{\beta}_{M}^{\text{ML}}β^​MML​

II $N$ -subjettiness Product Observables

IV.1 Ungroomed $H\rightarrow b\overline{b}$ vs. $g\rightarrow b\overline{b}$ discrimination

IV.2 Ungroomed $Z^{\prime}$ vs. $\mathrm{QCD}$

Appendix B Groomed $H\rightarrow b\overline{b}$ vs. $g\rightarrow b\overline{b}$ discrimination

Appendix C Groomed $Z^{\prime}$ vs. $\mathrm{QCD}$

Appendix D Mass dependence of $\beta_{M}^{\text{ML}}$

Appendix E Saturating the discrimination power of $\hat{\beta}_{M}^{\text{ML}}$