Reducing the dependence of the neural network function to systematic uncertainties in the input space
Stefan Wunsch, Simon J\"orger, Roger Wolf, G\"unter Quast

TL;DR
This paper introduces a novel training method for neural networks that reduces their sensitivity to systematic uncertainties in input data by penalizing output variations, improving robustness in scientific data analysis.
Contribution
The paper presents a new approach of incorporating penalties on output variation into the loss function to mitigate dependence on systematic uncertainties in neural networks.
Findings
Effective in reducing sensitivity to systematic uncertainties
Applicable to complex scientific data analysis scenarios
Demonstrated with high-energy physics example
Abstract
Applications of neural networks to data analyses in natural sciences are complicated by the fact that many inputs are subject to systematic uncertainties. To control the dependence of the neural network function to variations of the input space within these systematic uncertainties, several methods have been proposed. In this work, we propose a new approach of training the neural network by introducing penalties on the variation of the neural network output directly in the loss function. This is achieved at the cost of only a small number of additional hyperparameters. It can also be pursued by treating all systematic variations in the form of statistical weights. The proposed method is demonstrated with a simple example, based on pseudo-experiments, and by a more complex example from high-energy particle physics.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
∎
11institutetext: Stefan Wunsch1,2 (corresponding author) 22institutetext: [email protected] 33institutetext: Simon Jörger1 44institutetext: [email protected] 55institutetext: Roger Wolf1 66institutetext: [email protected] 77institutetext: Günter Quast1 88institutetext: [email protected] 99institutetext: 1 Karlsruhe Institute of Technology, Institute of Experimental Particle Physics, Karlsruhe, Germany
2 CERN, Geneva, Switzerland
Reducing the dependence of the neural network function to systematic uncertainties in the input space
Stefan Wunsch
Simon Jörger
Roger Wolf
Günter Quast
Abstract
Applications of neural networks to data analyses in natural sciences are complicated by the fact that many inputs are subject to systematic uncertainties. To control the dependence of the neural network function to variations of the input space within these systematic uncertainties, several methods have been proposed. In this work, we propose a new approach of training the neural network by introducing penalties on the variation of the neural network output directly in the loss function. This is achieved at the cost of only a small number of additional hyperparameters. It can also be pursued by treating all systematic variations in the form of statistical weights. The proposed method is demonstrated with a simple example, based on pseudo-experiments, and by a more complex example from high-energy particle physics.
1 Introduction
Neural network (NN) techniques are in wide and increasing use to solve classification and regression tasks in the analysis of high-energy particle physics data. Examples of their use in physics object identification, e.g. at the LHC experiments ATLAS and CMS, are the classification of particle jets induced by heavy flavor quarks Aad:2015ydr ; Sirunyan:2017ezt and the identification of leptons Aad:2015unr ; Sirunyan:2018pgf . Examples for data analyses that make use of NNs not only for object identification, but to distinguish between signal- and background-like samples are the latest analyses of Higgs boson events in association with third generation fermions, at the LHC Aaboud:2018urx ; Aaboud:2018zhk ; Sirunyan:2018hoz ; Sirunyan:2018kst ; CMS-PAS-HIG-18-032 . These classification tasks usually aim at the distinction of a signal from one or more background processes. They are characterized by a relatively small number of input parameters to the NN, of one or two orders of magnitude, which may reveal non-trivial correlations among each other.
Each physics measurement is subject to systematic uncertainties, which have to be propagated from the input space to the NN output . This usually happens in terms of variations of a given input parameter within its uncertainties . We abbreviate the set of by and the set of modified input parameters by . These variations may be implemented in the form of variations of the actual values of , or such that a sample, with a given value of , enters the analysis with a different statistical weight, also referred to as reweighting throughout this text. Unlike varying the values of , reweighting does not rely on a reprocessing of the dataset and therefore generally implies significantly smaller computational costs.
The possibility to implement prior information about systematic uncertainties already in the NN training is motivated by two considerations: Firstly, a powerful distinction between classes in principle, can be considerably compromised by systematic uncertainties. Integrating prior knowledge of uncertainties in the NN training helps in guiding the NN to focus on features in the input space that are less prone to such a performance degradation. This may even result in a gain for the analysis performance, as observed in Ref. shimmin2017decorrelated . Secondly, the dependence of a systematic variation of a given feature on other parameters in the input space, might only be poorly known, or even unknown, and the user might want to generally uncorrelate the NN output from this uncertainty to assure a reliable response of the NN to the given task. Both points raise interest in training the NN with the boundary condition that the dependence of on should be minimal.
One way of achieving this decorrelation of from that has been proposed in the past and that we will refer to in more detail throughout this paper, makes use of a secondary NN that is trained in addition to the primary NN in an iterative procedure, resulting in an adversarial architecture Goodfellow:2014upx for robust binary classification louppe2017learning . This secondary NN has the task of drawing information of the systematic variation from the output of the primary NN. The output of the secondary NN is then included in the loss function of the primary NN as part of a minimax optimization problem. The resulting setup becomes insensitive to the systematic variation of the inputs. This method requires a relatively complex iterative training procedure; it introduces a large and to some extent arbitrary number of new hyperparameters implied by the choice of the architecture of the secondary NN, and requires the resampling of within its uncertainties .
Another approach to decorrelate from is to include the knowledge about systematic uncertainties in a systematics-aware objective function as proposed in Refs. deCastro:2018mgh and Charnock:2018ogm . An approach related to boosted decision trees is implemented by splitting the tree nodes using the signal significance including systematic uncertainties as objective, resulting in a classifier that successfully reduces the impact of systematic uncertainties on the result Xia:2018kgd . A similar approach for NNs has been studied in Ref. Elwood:2018qsr . A comparison of systematics-aware learning techniques in high-energy particle physics has been carried out in Ref. estrade:hal-01715155 . In addition to the adversarial approach discussed above this study includes a comparison to data perturbation and augmentation, and tangent propagation NIPS1991_536 .
In our approach we implement a penalty on the differences between the NN output obtained from the nominal value of and its variations , directly into the loss function. For this purpose we use histograms of and filled during each training batch. The number of histogram bins , and the batch size are hyperparameters of the training. To guarantee a differentiable loss function for the optimization of the trainable parameters of the NN, the histogram bins are blurred by a filter function applied to each sample of the training batch, affected by the uncertainty variations, where corresponds to a single sample represented by a point associated to each respective training dataset in the input space . We use Gaussian functions , normalized to as filters, where the mean and standard deviation are given by the center and half-width of histogram bin . The count estimate can then be written as , and the loss function consists of the two parts
[TABLE]
where corresponds to the loss function of the primary task, like for example the cross-entropy function for a classification task, and to the term that penalizes differences in the NN function between and . The factor controls the influence of the penalty and adds another hyperparameter to the training. The count estimate can be derived from in terms of reweighting, such that no reprocessing of the dataset during the training procedure is required.
In this approach more than one uncorrelated uncertainty simply adds to the sum of , for uncorrelated uncertainties. Two fully (anti-) correlated uncertainties should be represented by a common variation for both uncertainties at the same time. While an exact modeling of correlations across uncertainties may not always be exactly known this knowledge is not strictly required by the method, as long as the loss function converges to its minimum and solves the defined task. The parameters correspond to further hyperparameters, whose values relative to each other define different tasks of the NN training. We would like to emphasize that the use of a histogram of (and respectively) in the loss function might lead to a suboptimal performance with respect to the direct use of . Also we do not claim the resulting discriminator to be optimal for the final measurement.
In Section 2 we demonstrate the method on a simple example based on pseudo-experiments. A more complex analysis task typical for high-energy particle physics is studied in Section 3. We summarize our findings in Section 4.
2 Application to a simple example based on pseudo-experiments
To illustrate our approach, we refer to a simple example based on pseudo-experiments that has also been used in Ref. louppe2017learning . It consists of two variables and , which are the input to separate two classes, in the following labeled as signal and background. The input space is visualized in Fig. 1. A systematic uncertainty for the background class is introduced by two variations of by . We consider only the discrete variations that quantify the difference between and , which is sufficient to define the part of to be minimized during the training process. We have checked that a Gaussian sampling with a standard deviation of , as applied in louppe2017learning would lead to the same result in a more complex setup.
The NN used to solve the classification tasks consists of two hidden layers with 200 nodes each, with rectified linear units as activation functions glorot2011deep and a sigmoid activation function for the output layer. The trainable parameters are initialized using the Glorot algorithm glorot2010understanding . The optimization is performed using the Adam algorithm kingma2014adam with a batch size of . Our choice for is the cross-entropy function. For , we use 10 equidistant bins in the range of the NN output. We have not observed any significant performance differences by varying the number of histogram bins within reasonable boundaries, though. Finally we set to 20. The training on events is stopped if the loss obtained from the training dataset has not decreased for five epochs in sequence, on an independent validation dataset of the same size. In addition, we use events for testing and to produce the figures to illustrate the result. The impact of the systematic variations on the NN output is shown in Fig. 2 for the case of a classifier trained with a loss function given only by () and a classifier based on a loss function including the additional penalty term ().
As can be seen from Fig. 2, the approach successfully mitigates the dependence of the NN output on the variation of and therefore results in a classifier that is more robust in the presence of this systematic uncertainty. This is achieved on the expense of obliterating at least parts, if not all, separating information of . Fig. 3 visualizes the NN output as a function of the input space spanned by and . The additional penalty term, , leads to the intended alignment of the surface of the NN output with the variation of , resulting in similar values of the NN output for all realisations of the systematic variation. We find our approach to have an effect similar to the setup described in louppe2017learning .
3 Application to a more complex analysis task typical for high-energy particle physics
In the following, we apply the proposed method to a more complex task typical for high-energy particle physics. We use a dataset that has been released for the Higgs boson machine learning challenge described in Ref. adambourdarios:hal-01208587 . This challenge uses a simplified synthetic dataset from simulated collisions of high-energy proton beams with underlying hypothesized signal and background processes at the CERN LHC. The original target of the challenge was to separate events containing the decay of a Higgs boson into two tau leptons (signal) from all other events (background), to serve as benchmark for the success of different machine learning algorithms. The consideration of uncertainties, as required for a complete analysis of the data was not part of it. The dataset contains 30 input parameters, whose exact physical meanings are given in Ref. adambourdarios:hal-01208587 . We split the dataset and use one third for training and validation of the NN and two thirds for deriving the following results.
For our example, we use all parameters as input for the NN training. In addition, we introduce a systematic uncertainty, resembling the fact that the momentum and energy of a particle are the results of external measurements with a finite resolution. For our study we assume an uncertainty of Aaboud:2018pen on the transverse momentum of the reconstructed hadronic decay , measured in GeV and labeled as PRI_tau_pt in Ref. adambourdarios:hal-01208587 . The distributions of the nominal and varied input parameters are visualized in Fig. 4 (upper row). To allow for migrations in and out of the selected input space due to the systematic variation we restrict the originally available dataset by raising the lower requirement from 20 to 25 GeV. For the background the distribution of is steeply falling. Thus the variation is dominated by migration effects at the lower boundary, resulting in an overall normalization uncertainty. The signal shows a maximum around , leading to a more apparent additional variation of the shape of the distribution, as shown in Fig. 4 upper right. The dataset used for these results contains 814.9 (163750) weighted (unweighted) signal events and 162705.0 (238778) weighted (unweighted) background events using an additional scaling of the weighted number of signal events by a factor of two.
Instead of resampling the signal and background datasets with the varied values of , we introduce the systematic variation in form of statistical weights. In this way we give a higher (lower) statistical weight to subsamples with low (high) values of with respect to the nominal sample. These weights are determined from the distributions shown in Fig. 4 (upper row) for the background and signal sample, respectively. By construction all correlations across features of the input space are conserved by the reweighting, thus that reweighting leads to shape variations also of correlated observables, e.g., like the reconstructed missing transverse momentum or the estimate of the invariant di- mass, described in Ref. adambourdarios:hal-01208587 , as shown in Fig. 4 (lower row). We would like to emphasize that this reweighting technique is in fact the only way to apply a systematic variation of that respects the correlations to all other features of the input space on the given dataset. In a realistic analysis the reweighting technique is not meant to replace the resampling, but rather to complement it. A resampling could and should be applied, where correlations across input features may not be desired. To give an example, is mostly determined from track information. Therefore an uncertainty in the missing transverse momentum due to uniformity uncertainties in the calibration of the hadronic calorimeter should not impact with a correlation of 100%. As in the case of the simple example of Section 2 we use only the two discrete shapes corresponding to the shifts in , which are a sufficient input for the minimization of the loss function during the training process. Samples of intermediate realizations of these shifts have been checked to lead to the same result despite the more complex setup.
The NN has the same architecture as described in Section 2. For the implementation of we chose 20 equidistant bins in the range of of the NN output for , and . The batch size is set to . The optimization of the trainable parameters is performed on of the training dataset and stopped if the loss has not decreased for 10 epochs in sequence, on the remaining part of the training dataset. The results are shown on an independent test dataset. We would like to emphasize that , is a free choice that has been made for illustrative purposes only. In a realistic application the optimal choice of should be studied on a case by case basis.
In Fig. 5 the NN outputs and are shown. As in the case of the simple example given in Section 2, though less pronounced, the training based on a loss function including leads to a mitigated dependence of the NN output on the systematic variation of . An important difference between both examples is that the uncertainty of the simple example given in Section 2 is exclusively shape altering. In contrast to this the uncertainty variation in this more complex example includes a significant component acting on the normalization of the NN output, especially for the background distribution. A pure normalization uncertainty that does not lead to noticeable differences in the input space that can be related to its systematic variation can not be mitigated. In consequence a dominant overall normalization uncertainty, visible especially for the background distribution of , is not significantly reduced by the use of .
In Fig. 6 the distributions for signal and background for the full sample, and for two signal-enriched subsamples are shown. The latter are obtained by a restriction of and to a value larger than . On the full sample a generally harder spectrum for the signal is observed with a maximum around 45 GeV, in contrast to a steadily falling and softer spectrum for the background. In the signal-enriched subsample based on the distribution for the background is biased towards the same distribution as for signal. In the signal-enriched subsample based on this bias is alleviated and the distributions for signal and background are qualitatively unchanged with respect to the full sample.
At the LHC experiments the presence of the Higgs boson signal has been inferred from hypothesis tests based on a likelihood ratio between the case of including the Higgs boson signal and that of the null hypothesis without Higgs boson signal atlas2011procedure . Systematic uncertainties have been incorporated in form of nuisance parameters, which might be correlated, e.g., across processes, into the likelihoods. Best estimates and constraints on these nuisance parameters have been obtained by nuisance parameter optimization. The presence of the signal has been quantified, e.g., by means of its statistical significance in terms of Gaussian standard deviations (s.d.), in the limit of large numbers. To serve our discussion we emulate this discovery scenario, in a simplified way, constructing binned likelihoods for the signal and null hypotheses based on the histograms shown in Fig. 5. In addition to the statistical uncertainties of the pseudo-data we incorporate the uncertainty indicated by the bands in Fig. 5 as process- and bin-correlated variations in the likelihoods, bound to a single nuisance parameter , following the prescriptions of atlas2011procedure . The fit of a Higgs boson signal hypothesis with a single signal strength parameter of interest, , to the pseudo-data, including the signal as expected by theory, leads to a constraint of the uncertainty in to 3% of its initial value, both in the case of and as input distributions to the fit. This constraint is dominated by the power of the pseudo-data to determine the normalization related to , especially in the first bins of the background dominated pseudo-data sample distribution, e.g., with more than 65 thousand counts in the first bin. When splitting the uncertainty into two independent nuisance parameters, to govern the pure normalization uncertainty, and to govern the pure shape altering uncertainty, we find the initial normalization uncertainty to be () for the background (signal) sample. We anticipate that the implementation with two independent nuisance parameters is not fully correct, but keeping this caveat in mind the study still serves the test we are interested in. After the fit of the Higgs signal hypothesis to the pseudo-data we observe the same constraint as on the uncertainty in before on the uncertainty in . We observe an correlation between and . The constraint on the uncertainty in is 0.8 (0.4) for and as input distributions to the fit, with a correlation of 55% (5%) to . We observe similar results when performing a fit of the null hypothesis. The reduction of the correlation of with , when using instead of gives a quantitative measure in this case of the decorrelation of the shape altering part of the uncertainty with the parameter of interest.
In Fig. 7 the significance of the analyzed signal in the pseudo-data, based on the fit to the null hypothesis is shown as a function of the hyperparameter , where corresponds to as input to the fit. Using as input to the fit leads to a significance of 6.7 s.d., corresponding to a combined systematic and statistical relative uncertainty in the parameter of interest of . This significance drops to a value of 5.2 s.d., corresponding to , for . Such a drop is expected, since plays an important role in the separation of signal and background, not only as a single feature, but also via its correlations to other features in the input space Wunsch:2018oxb . The scan of in this way visualizes to what extend the separation relevant information related to in the input space that is vulnerable to the variation of , is masked during the training process for increasing values of . The information loss seems small for values of with a significant drop around and a plateau around , which is the value we have chosen for our study. At this point most of the separation relevant information related to that is vulnerable to the variation of seems to be masked out from the training, such that turns mostly blind for . Implicitly this can also be inferred from Fig. 6, where the distribution of qualitatively is the same for the signal-enriched and the inclusive samples.
In turn the uncertainty on the significance due to the systematic variation drops, roughly proportional to the loss in significance, from (for ) to (for ). We estimate the contribution of the systematic variation in to , with (for ), dropping to (for ). At the same time, and with a larger slope, the absolute contribution of the statistical uncertainty to increases from (for ) to (for ), resulting in the overall decrease of the significance for increasing values of , for the given example. The loss in statistical power stems from the worse separation of signal and background for increasing values of , as also visible from Fig. 5.
Increasing to larger and larger values leads to another drop of the significance, which converges to the value for a single counting experiment that does not distinguish between signal and background, in the limit of . This can be understood in terms of completely dominating the loss function thus that will more and more loose influence in the training task. As a consequence the NN will primarily be optimized on the suppression of the variation of rather than the separation of signal and background.
We would like to point out at the end of this discussion that it is usual practice in a measurement scenario to accept the increase of statistical uncertainty, which can in principle be controlled by an increase of the dataset for the benefit of a reduced sensitivity of the measurement on systematic variations of its input parameters, which might be difficult to control. We anticipate though that in the given scenario remains the choice that maximizes the significance of the analysis despite its larger sensitivity to the systematic variation in this case. Our choice of should be viewed as a free while still sensible choice to showcase the reduction of the influence of the systematic variation on the NN output.
4 Summary
We have presented a new approach to reduce the dependence of the NN output to variations of features of the NN input space due to systematic uncertainties in the measured input parameters. We achieve this reduction by including the variation of the NN output w.r.t. the nominal value of in the loss function used for training. Compared to a previously published method of using an adversarial technique, the complexity of the presented method is reduced to one additional term in the loss function with less hyperparameters and no further trainable parameters. Systematic variations can be inscribed in the form of statistical weights, implying no further needs of reprocessing, further reducing the complexity of the training. Additional uncertainties just add to the sum of penalty terms in the loss function. In turn the method requires batch sizes large enough to populate the blurred histogram of the NN output used for the evaluation of the variation w.r.t the nominal value of in the loss function.
We have demonstrated the new approach with a simple example directly comparable to a solution of the same task exploiting the adversarial technique, and a more complex analysis task typical for high-energy particle physics experiments. In all cases the dependence of the NN output on the variation of a chosen input parameter is successfully mitigated. In application to a high-energy particle physics measurement this leads to a result less prone to systematic uncertainties, which is of increasing interest in the presence of growing datasets, where statistical uncertainties play a subdominant role in the measurement.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1) The ATLAS collaboration: Performance of b 𝑏 b -Jet Identification in the ATLAS Experiment. JINST 11 (04) (2016) P 04008
- 2(2) The CMS collaboration: Identification of heavy-flavour jets with the CMS detector in pp collisions at 13 Te V. JINST 13 (05) (2018) P 05011
- 3(3) The ATLAS collaboration: Reconstruction of hadronic decay products of tau leptons with the ATLAS experiment. Eur. Phys. J. C 76 (5) (2016) 295
- 4(4) The CMS collaboration: Performance of reconstruction and identification of τ 𝜏 \tau leptons decaying to hadrons and ν τ subscript 𝜈 𝜏 \nu_{\tau} in pp collisions at s = 𝑠 absent \sqrt{s}= 13 Te V. JINST 13 (10) (2018) P 10005
- 5(5) The ATLAS collaboration: Observation of Higgs boson production in association with a top quark pair at the LHC with the ATLAS detector. Phys. Lett. B 784 (2018) 173–191
- 6(6) The ATLAS collaboration: Observation of H → b b ¯ → 𝐻 𝑏 ¯ 𝑏 H\rightarrow b\bar{b} decays and V H 𝑉 𝐻 VH production with the ATLAS detector. Phys. Lett. B 786 (2018) 59–86
- 7(7) The CMS collaboration: Observation of t t ¯ t ¯ t \mathrm{t\overline{t}} H production. Phys. Rev. Lett. 120 (23) (2018) 231801
- 8(8) The CMS collaboration: Observation of Higgs boson decay to bottom quarks. Phys. Rev. Lett. 121 (12) (2018) 121801
