L2LFlows: Generating High-Fidelity 3D Calorimeter Images

Sascha Diefenbacher; Engin Eren; Frank Gaede; Gregor Kasieczka,; Claudius Krause; Imahn Shekhzadeh; and David Shih

arXiv:2302.11594·physics.ins-det·October 23, 2023

L2LFlows: Generating High-Fidelity 3D Calorimeter Images

Sascha Diefenbacher, Engin Eren, Frank Gaede, Gregor Kasieczka,, Claudius Krause, Imahn Shekhzadeh, and David Shih

PDF

1 Repo

TL;DR

L2LFlows employs layered normalizing flows conditioned on previous layers to generate high-fidelity 3D calorimeter images, significantly improving over existing generative models in simulating photon showers.

Contribution

The paper introduces Layer-to-Layer-Flows, a novel high-dimensional normalizing flow architecture conditioned on multiple layers for improved calorimeter image generation.

Findings

01

L2LFlows outperforms BIB-AE in image fidelity.

02

The model effectively captures layer-to-layer correlations.

03

High-dimensional normalizing flows are feasible for detailed detector simulations.

Abstract

We explore the use of normalizing flows to emulate Monte Carlo detector simulations of photon showers in a high-granularity electromagnetic calorimeter prototype for the International Large Detector (ILD). Our proposed method -- which we refer to as "Layer-to-Layer-Flows" (L $2$ LFlows) -- is an evolution of the CaloFlow architecture adapted to a higher-dimensional setting (30 layers of $10 \times 10$ voxels each). The main innovation of L $2$ LFlows consists of introducing $30$ separate normalizing flows, one for each layer of the calorimeter, where each flow is conditioned on the previous five layers in order to learn the layer-to-layer correlations. We compare our results to the BIB-AE, a state-of-the-art generative network trained on the same dataset and find our model has a significantly improved fidelity.

Tables4

Table 1. Table 1 : For the conditioning on the previous 5 5 5 ECal layers, i.e. n cond = 5 subscript 𝑛 cond 5 n_{\text{cond}}=5 , this table shows the context features each NF gets and their shape before being fed into an embedding network. Here, N 𝑁 N denotes the batch size used during training or sampling.

NF $i$	Context features	Context shape
$0$	$E_{0}$ , $E_{inc}$	[ $N$ , $2$ ]
$1$	$ℐ_{0}$ , $E_{1}$ , $E_{inc}$	[ $N$ , $102$ ]
$2$	$ℐ_{0}$ , $ℐ_{1}$ , $E_{2}$ , $E_{inc}$	[ $N$ , $202$ ]
$3$	$ℐ_{0}$ , $ℐ_{1}$ , $ℐ_{2}$ , $E_{3}$ , $E_{inc}$	[ $N$ , $302$ ]
$4$	$ℐ_{0}$ , $ℐ_{1}$ , $ℐ_{2}$ , $ℐ_{3}$ , $E_{4}$ , $E_{inc}$	[ $N$ , $402$ ]
$\geq 5$	$ℐ_{i - 5}$ , $ℐ_{i - 4}$ , $ℐ_{i - 3}$ , $ℐ_{i - 2}$ , $ℐ_{i - 1}$ , $E_{i}$ , $E_{inc}$	[ $N$ , $502$ ]

Table 2. Table 2 : Classifier results for different number of showers, where the left column shows the number of showers per simulator used for the classifier tests (a 60 % : 20 % : 20 % : percent 60 percent 20 : percent 20 60\%:20\%:20\% split is made to obtain training, validation and test showers of the classifiers). The middle and right columns show the mean and standard deviation of the AUC of 10 10 10 independent runs for Geant4 vs L 2 2 2 LFlows and Geant4 vs BIB-AE classifiers. Since the mean AUC of the BIB-AE in 10 10 10 independent runs is already very close to 1 1 1 for 95 95 95 k showers, more showers are only used for the Geant4 vs L 2 2 2 LFlows classifiers.

# Showers per simulator	AUC Geant4 vs L $2$ LFlows	AUC Geant4 vs BIB-AE
$95$ k	$0.8518 \pm 0.0042$	$0.9947 \pm 0.0025$
$190$ k	$0.8768 \pm 0.0029$	$-$
$380$ k	$0.8962 \pm 0.0024$	$-$
$760$ k	$0.9402 \pm 0.0011$	$-$

Table 3. Table 3 : For 25 25 25 runs, the mean and the standard deviation of the sampling time per shower as well as the obtained speedup in comparison with Geant4 are shown for different batch sizes and hardware during sampling for Geant4 , the BIB-AE and L 2 2 2 LFlows . The GPU is an NVIDIA ® A100 ® with 40 40 40 GB VRAM. For the CPU, an Intel ® Xeon ® E5-2640 v4 was chosen, and the value for Geant4 is taken from Ref. [ 12 ] , where the simulated showers have a shape of 30 × 30 × 30 30 30 30 30\times 30\times 30 .

Simulator	Hardware	Batch size	$10$ – $100$ GeV [ms]	Speedup
Geant4 ( $30 \times 30 \times 30$ )	CPU	$/$	$4081.53$ p m 169.92	$/$
L $2$ LFlows	CPU	$1$	$19 617.24$ p m 894.08	$\times 0.2$
( $30 \times 10 \times 10$ )		$10$	$3130.25$ p m 104.74	$\times 1.3$
		$100$	$1395.52$ p m 26.55	$\times 2.9$
		$1000$	$1338.13$ p m 24.03	$\times 3.1$
BIB-AE	CPU	$1$	$102.25$ p m 0.64	$\times 40$
( $30 \times 10 \times 10$ )		$10$	$37.81$ p m 0.13	$\times 𝟏𝟏𝟎$
		$100$	$48.51$ p m 0.01	$\times 84$
		$1000$	$48.19$ p m 0.01	$\times 85$
L $2$ LFlows	GPU	$1$	$22 560.34$ p m 263.00	$\times 0.2$
( $30 \times 10 \times 10$ )		$10$	$2103.58$ p m 18.36	$\times 1.9$
		$100$	$213.38$ p m 0.23	$\times 19$
		$1000$	$23.14$ p m 0.16	$\times 180$
		$2000$	$13.70$ p m 0.03	$\times 300$
		$8000$	$9.61$ p m 0.01	$\times 420$
		$128000$	$8.62$ p m 0.02	$\times 470$
BIB-AE	GPU	$1$	$74.22$ p m 3.18	$\times 55$
( $30 \times 10 \times 10$ )		$10$	$6.85$ p m 0.25	$\times 600$
		$100$	$0.91$ p m 0.02	$\times 4500$
		$1000$	$0.249$ p m 0.002	$\times 𝟏𝟔𝟎𝟎𝟎$
		$2000$	$0.248$ p m 0.001	$\times 𝟏𝟔𝟎𝟎𝟎$

Table 4. Table 4 : Most important hyperparameters for the training of L 2 2 2 LFlows . For the causal flows , the hyperparameters are identical for every NF i 𝑖 i learning layer i 𝑖 i . The noise hyperparameter refers to the elementwise noise that is added during training.

Hyperparameter	Value
Hyperparameter	energy distribution flow	causal flows
Learning rate	$6 \cdot 10^{- 5}$	$6 \cdot 10^{- 4}$
Optimizer	ADAM	ADAM
Batch size	$256$	$1024$
$#$ Epochs	$200$	$200$
$#$ MADE blocks	$4$	$4$
$#$ Hidden layers	$1$	$1$
$#$ Hidden nodes	$64$	$128$
$#$ RQS bins	$8$	$8$
Min. bin width/height	$10^{- 6}$	$10^{- 6}$
Min. derivate	$10^{- 6}$	$10^{- 6}$
Cutoff value $α$	$10^{- 6}$	$10^{- 6}$
Permutations	“random”	“reverse”
Dtype training	float32	float32
Dtype generation	float64	float64
Noise	Gaussian	Uniform
	( $μ = 1$ keV, $σ = 0.2$ keV)	( $a = 0$ keV, $b = 1$ keV)
Base distribution	$30$ -dim. multivar. Gaussian	$100$ -dim. multivar. Gaussian
	( $μ = 0$ , $Σ = Id$ )	( $μ = 0$ , $Σ = Id$ )

Equations35

p_{X} (x) = p_{Z} (f^{- 1} (x)) \cdot det \frac{\partial f ^{- 1} ( x )}{\partial x},

p_{X} (x) = p_{Z} (f^{- 1} (x)) \cdot det \frac{\partial f ^{- 1} ( x )}{\partial x},

p (E_{0}, \dots, E_{29} ∣ E_{inc}),

p (E_{0}, \dots, E_{29} ∣ E_{inc}),

p (I_{0}, \dots, I_{29} ∣ E_{0}, \dots, E_{29}, E_{inc})

p (I_{0}, \dots, I_{29} ∣ E_{0}, \dots, E_{29}, E_{inc})

p_{i} (I_{i} ∣ I_{0}, \dots, I_{i - 1}, E_{0}, \dots, E_{29}, E_{inc}), i = 0, \dots, 29

p_{i} (I_{i} ∣ I_{0}, \dots, I_{i - 1}, E_{0}, \dots, E_{29}, E_{inc}), i = 0, \dots, 29

p_{i} (I_{i} ∣ I_{i - n_{cond}}, \dots, I_{i - 1}, E_{0}, \dots, E_{29}, E_{inc})

p_{i} (I_{i} ∣ I_{i - n_{cond}}, \dots, I_{i - 1}, E_{0}, \dots, E_{29}, E_{inc})

p_{i} (I_{i} ∣ G_{i} (I_{i - n_{cond}}, \dots, I_{i - 1}, E_{0}, \dots, E_{29}, E_{inc}))

p_{i} (I_{i} ∣ G_{i} (I_{i - n_{cond}}, \dots, I_{i - 1}, E_{0}, \dots, E_{29}, E_{inc}))

p_{i} (I_{i} ∣ G_{i} (I_{i - n_{cond}}, \dots, I_{i - 1}, E_{i}, E_{inc})) .

p_{i} (I_{i} ∣ G_{i} (I_{i - n_{cond}}, \dots, I_{i - 1}, E_{i}, E_{inc})) .

L2LFlows_{i, j}^{relative}

L2LFlows_{i, j}^{relative}

BIB-AE_{i, j}^{relative}

E_{i}^{proc} := ℓ_{α} (E_{i}) \forall i \in {0, \dots, 29},

E_{i}^{proc} := ℓ_{α} (E_{i}) \forall i \in {0, \dots, 29},

ℓ_{α} (x) = ln \frac{α + ( 1 - 2 α ) x}{1 - α - ( 1 - 2 α ) x},

ℓ_{α} (x) = ln \frac{α + ( 1 - 2 α ) x}{1 - α - ( 1 - 2 α ) x},

E_{inc}^{proc} := 2 \cdot lo g_{10} (E_{inc}) - 3 \in [- 1, 1],

E_{inc}^{proc} := 2 \cdot lo g_{10} (E_{inc}) - 3 \in [- 1, 1],

I_{j}^{logit} := ℓ_{α} (\frac{I _{j}}{max}) \forall j,

I_{j}^{logit} := ℓ_{α} (\frac{I _{j}}{max}) \forall j,

E_{i} := [i \sum (I_{i}^{cut} + noise)],

E_{i} := [i \sum (I_{i}^{cut} + noise)],

I_{i}^{cut} := (I_{i} \geq threshold),

I_{i}^{cut} := (I_{i} \geq threshold),

E_{i}^{proc} := lo g_{10} (E_{i} + ϵ) + 1,

E_{i}^{proc} := lo g_{10} (E_{i} + ϵ) + 1,

I_{i}^{pp} := E_{i} \cdot I_{i, \geq t} / S (I_{i, \geq t}),

I_{i}^{pp} := E_{i} \cdot I_{i, \geq t} / S (I_{i, \geq t}),

E_{i} \cdot t / S (I_{i, \geq t}) = desired threshold .

E_{i} \cdot t / S (I_{i, \geq t}) = desired threshold .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://gitlab.com/imahn/l2lflows
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

L $\bm{2}$ LFlows: Generating High-Fidelity $\bm{3}$ D Calorimeter Images

Sascha Diefenbacher

Engin Eren

Frank Gaede

Gregor Kasieczka

Claudius Krause11footnotetext: Corresponding author.

Imahn Shekhzadeh

David Shih

Abstract

We explore the use of normalizing flows to emulate Monte Carlo detector simulations of photon showers in a high-granularity electromagnetic calorimeter prototype for the International Large Detector (ILD). Our proposed method — which we refer to as \csq@thequote@oinit\csq@thequote@oopenLayer-to-Layer Flows\csq@thequote@oclose (L $2$ LFlows) — is an evolution of the CaloFlow architecture adapted to a higher-dimensional setting ( $30$ layers of $10\times 10$ voxels each). The main innovation of L $2$ LFlows consists of introducing $30$ separate normalizing flows, one for each layer of the calorimeter, where each flow is conditioned on the previous five layers in order to learn the layer-to-layer correlations. We compare our results to the BIB-AE, a state-of-the-art generative network trained on the same dataset and find our model has a significantly improved fidelity.

1 Introduction

In order to study Nature at the fundamental level and rigorously test the Standard Model (SM) of particle physics, current and future collider experiments need accurate and plentiful simulations of the detector response. The most precise simulation toolkit in high-energy physics is Geant4 [1, 2, 3]; however its precision comes at an enormous computational cost. The bulk of this cost is borne by the simulation of individual particle showers in the calorimeter — so much so that it is currently a major bottleneck at the LHC, and is forecast to overwhelm the available computational resources without further R&D [4, 5]. This has motivated, in recent years, a growing interest into using deep generative models as fast and accurate emulators of Geant4 simulations [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]. For this purpose, a variety of generative architectures, including generative adversarial networks (GANs) [24], variational autoencoders (VAEs) [25], normalizing flows (NFs) [26], and score-based generative models [27, 28, 29] have been considered.222For a recent review see Ref. [30].

Especially NFs have shown promising fidelity when applied to the simulation of comparatively low-dimensional ( $d\sim 500$ ) calorimeter datasets [14, 15, 22]. NFs are diffeomorphisms between the data space $\mathbb{R}^{d}$ and a latent space with a tractable distribution such as a Gaussian in $\mathbb{R}^{d}$ . They are trained by minimizing the negative log-likelihood (NLL), which gives them a more meaningful loss function than the commonly used GANs or VAEs, and tends to result in higher-quality generated samples for calorimeter simulations. However, the drawback of NFs is that they are very memory-intensive, requiring many parameters to encode a sufficiently expressive invertible transformation between the data space and a latent space of the same dimensionality. This has made it challenging to generalize the CaloFlow approach of [14, 15, 22] to higher-dimensional calorimeter datasets. Going to higher dimensionalities in calorimeter read-outs is desired to improve the accuracy of particle reconstructions and to aid in separating overlapping showers. This motivates future detector concepts such as the International Large Detector (ILD) [31, 32], which is one of the proposed detectors at the International Linear Collider (ILC); or the CMS HGCAL at the HL-LHC [33].

This work explores the steps needed to adapt NFs to a higher-dimensional calorimeter dataset, resulting in the new L $2$ LFlows architecture333Our code is publicly available at https://gitlab.com/Imahn/l2lflows and [34]. Our training data is available at [35].. Although the methods we devise are fairly general and should have many potential future applications (including to datasets 2 and 3 of the CaloChallenge [36], which are higher-dimensional than the considered dataset in this work), we will focus on using photon showers in an electromagnetic calorimeter (ECal) prototype for ILD as a testbed, for concreteness. These showers are simulated using Geant4, and projected to a regular grid of $30\times 30\times 30$ voxels (there are $30$ layers in total, each layer having $30$ voxels in $x$ - and $y$ -direction respectively). Unlike previous works based on this dataset [12, 16], here we further reduce the transverse dimensionality to $10\times 10$ (by retaining only the central voxels in each layer) resulting in a $30\times 10\times 10$ dataset shape. This is appropriate, since the particle impinges on the center of the $30\times 30\times 30$ cube. The reduction in dimsensionality is done to shorten the computation times needed for this first proof-of-concept demonstration of NFs for higher-dimensional calorimeter simulation. As we discuss further in Sec. 5, we expect the generalization to the full $30\times 30\times 30$ -dimensional dataset to be straightforward.

As in the CaloFlow [14, 15, 22] approach, we choose a two-step strategy for the architecture, where we generate the total energy depositions and the shower shapes in each layer separately. The first step — total energy depositions per layer — is extremely lightweight and essentially unchanged from original CaloFlow. We will refer to this first step as energy distribution flow (in [14, 15, 22] it was called “Flow I”). The second step — describing shower shapes in each layer — is where we have innovated beyond the original CaloFlow algorithm. There, a second NF (called “Flow II”) was trained to generate the full shower across all layers, but in our experiments this did not generalize well to the higher-dimensional setting in terms of memory consumption. So instead, here we choose to train $30$ separate NFs, where each NF generates the shower in one specific calorimeter layer, but is conditioned on the voxel energies in the five previous layers. We refer to this step as causal flows and this is the key innovation that allows us to generalize to higher-dimensional datasets. By splitting it into $30$ separate NFs, with conditioning from layer to layer, we keep both the memory requirements and the fidelity of the generated showers commensurate with original CaloFlow.

We will see that this new approach yields superior performance along several performance metrics compared to the state-of-the-art Bounded Information Bottleneck Autoencoder (BIB-AE) [12, 16, 18] architecture. In addition, this approach generalizes naturally to more irregularly-shaped detector voxelizations and allows for parallel training on multiple GPUs.

The structure of this paper is as follows: In Sec. 2, the dataset is introduced in more detail; Sec. 3 describes our architecture; Sec. 4 shows our results; and finally Sec. 5 concludes and gives an outlook for future work.

2 Dataset

The ILD [31, 32] is one of two proposed detector concepts for ILC. As a modern detector concept, the ILD is specifically optimized for particle flow algorithms (PFAs) [37, 38], which aim at the correct reconstruction of every individual particle created in the event. One key requirement for PFA is a precise and highly granular set of hadronic and electromagnetic calorimeters.

The ILD ECal that is used as the basis of this study is a sampling calorimeter with $30$ alternating layers with passive tungsten absorbers and active silicon sensors. The first 20 absorber layers have a thickness of $2.1~{}\textrm{mm}$ with the subsequent $10$ layers being twice as thick. Each silicon layer features individual cells with a size of $5\times 5$ mm2.

We utilize a dataset containing $950$ k photon showers simulated in the detailed and realistic detector model of ILD, implemented in DD4hep [39] in the iLCSoft [40] framework. This dataset was used in past work on generative calorimeter simulation [12], where it is described in more detail. For this work, the $950$ k showers are split into $760$ k training, $95$ k validation and $95$ k test showers. These showers originate from photons with energies uniformly distributed between $10$ and $100$ GeV and were simulated using Geant4 version $10.4$ (with the QGSP_BERT physics list). All photons hit the calorimeter at the same position and perpendicularly to the calorimeter layers. The coordinate system used in this work defines the $z$ -axis to be parallel to the trajectory of the photons, with the $xy$ -plane being parallel to the calorimeter layers. For the classifier tests in Sec. 4.2, further $665$ k independent test showers are available.

In addition, for some comparison plots, we also use independent test sets containing $4$ k showers with discrete energies between $20$ and $100$ GeV in $10$ GeV steps. These discrete incident energies will be used to study the linearity and width amongst other quantities.

While previous generative projects on this dataset [12, 16] used a data shape of $30\times 30\times 30$ , this work focuses on the core of the showers located in a $10\times 10$ cell region in the $xy$ -plane around the impact point to reduce the computation times and the memory footprint of the generative models. The cores of the showers still contain $92\%$ of the shower energy. This results in a data shape of $30\times 10\times 10$ where the first dimension indicates the depth along the propagation direction of the shower.

3 L $2$ LFlows

Our approach uses NFs [26] to learn the probability density of showers in the calorimeter conditioned on the incident energy, $p(\text{shower}|\text{incident energy})$ . NFs efficiently learn a change-of-variables transformation

[TABLE]

where $p_{X}$ and $p_{Z}$ are probability density functions in data and latent space, respectively, $f:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ is a diffeomorphism between the two spaces, $x=f(z)$ and $\partial f^{-1}/\partial x$ denotes the Jacobian matrix of $f^{-1}$ . Since Eq. (3.1) gives us access to the negative log-likelihood (NLL) of data points, NFs can be trained by minimizing the NLL directly.

In our case, $X$ will be the voxelized energy deposits of individual calorimeter showers and $Z$ a multivariate Gaussian distribution. In order to compute the Jacobian determinant efficiently, we use autoregressive transformations realized as masked autoregressive flows (MAFs) [41], built with rational quadratic splines (RQS) [42, 43] and Masked Autoencoder for Distribution Estimation (MADE) blocks [44]. The NFs are implemented with the help of the nflows package [45] in PyTorch [46].

In the following subsections, we will describe the two parts of L $2$ LFlows, the energy distribution flow and causal flows, which generate the layer energies and shower shapes respectively.

3.1 energy distribution flow

The task of the energy distribution flow is to learn the total energy depositions per ECal layer444Note that we start counting the ECal layers from [math] instead of $1$ ., which is described by the following conditional PDF:

[TABLE]

where $E_{\text{inc}}$ denotes the incident particle energy, and $E_{i}$ is the energy deposited by the shower in layer $i$ , obtained by summing over all voxels in the given layer $i$ .

Within a sampling calorimeter, it is necessary to apply an energy threshold to account for the fact that calorimeters have inherent electronic noise, and thus depositions that are too small become unreliable. We, therefore, apply a cutoff to the individual voxel energies with a threshold of $10^{-4}$ GeV before calculating the layer energies $E_{i}$ . This threshold corresponds to half the energy loss of a minimum-ionizing particle in the ILD ECal [12].

The energy distribution flow is lightweight and, as such, does not present a bottleneck on computation times. Therefore, the model closely follows the original CaloFlow approach. The most noteworthy change is the modified preprocessing. Since the ILD ECAL is a sampling calorimeter, only a fraction of the energy of a particle is recorded. Thus, it was not necessary to enforce a strict energy upper limit $\sum_{i}E_{i}\leq E_{\text{inc}}$ through preprocessing, as was the case in the original CaloFlow. Instead, we choose a simpler preprocessing, outlined in App. B.

The energy distribution flow architecture is shown in Fig. 1 and details of the model and its training can be found in App. A. By construction, generation happens recursively over the dimensionality of MAFs. In total, the energy distribution flow has about $200$ k parameters and was trained for $200$ epochs on a single NVIDIA® V100® with $32$ GB VRAM, which took less than $8$ hours. We subsequently use the validation NLL to select the best checkpoint among the $200$ epochs.

3.2 causal flows

Next, we turn to the second step of the generation process: generating shower shapes conditioned on the total incident energy and the total deposited energies in each layer. Our overarching goal here, as in the original CaloFlow, is to learn

[TABLE]

where the ECal voxel energy depositions of layer $i$ are denoted by $\mathcal{I}_{i}\in{\mathbb{R}}^{100}$ . Unlike in Sec. 3.1, no cutoff is applied to the voxel energy depositions used in the causal flows training. This prevents potential sharp edges in the voxel data, which would be caused by the cutoff, from interfering with the training of the causal flows. (For the energy distribution flow, this issue was already circumvented, as each layer energy is the aggregate of multiple voxels, lessening any potential edges.) The voxel energy depositions are preprocessed similarly to the layer energies used in the energy distribution flow. The precise nature of the preprocessing is outlined in App. B.

In the original CaloFlow, a single NF was trained on all the calorimeter voxels of every layer together, to directly learn (3.3). Since the number of parameters of a single NF scales quadratically with the dimensionality $d$ of the samples, the single-NF approach of original CaloFlow applied to the ILD dataset (which has $d=3000$ ) would lead to a prohibitive number of parameters ( $>1$ B). One can attempt to reduce the number of parameters by decreasing the number of MADE blocks as well as RQS bins, but this leads to a significantly reduced fidelity.

To reduce the number of parameters without sacrificing quality, our key idea here is to instead train one NF per ECal layer. Since the evolution of a shower in layer $i$ depends on what happened in the previous layers, NF $i$ has to be conditioned on the voxel energy depositions of the previous layers. In other words, we endeavor to train $30$ separate NFs to learn the distributions:

[TABLE]

If each distribution $p_{i}$ could be learned perfectly, then they could be multiplied together to reconstruct the full joint distribution (3.3). This would be in effect its own kind of autoregressive model. However, in later layers, there are a lot of conditioning features, and we observed that attempting to model the full conditional likelihood (3.4) resulted in suboptimal performance.

Instead, we found it beneficial to approximate the full conditional distribution (3.4) with:

[TABLE]

i.e. to truncate the conditional at $n_{\text{cond}}$ previous layers. Due to the computational cost, a complete scan over $n_{\text{cond}}$ was not possible, but some small trials convinced us that $n_{\text{cond}}=5$ gave a reasonably good balance between number of parameters and performance.

In an effort to further reduce the number of parameters in the NFs, the models are not directly conditioned on the full previous layers. Instead, these layers are passed through an embedding network $G_{i}$ :

[TABLE]

In an ablation study, we found the performace difference between conditioning on only $E_{i}$ and $E_{\text{inc}}$ and conditioning on all $E_{0}$ , $\dots$ , $E_{29}$ and $E_{\text{inc}}$ to be small, and hence for simplicity, we only condition on $E_{i}$ and $E_{\text{inc}}$ ; hence, our PDF from Eq. (3.6) simplifies to

[TABLE]

The embedding network $G_{i}$ takes in the context features ${\mathcal{I}}_{i-n_{\text{cond}}}$ , $\dots$ , ${\mathcal{I}}_{i-1}$ , $E_{i}$ , $E_{\text{inc}}$ and learns a representation of them that minimizes the NLL loss of the NFs. It is trained jointly with the NF, and there are different kinds of architectures one can consider, e.g. a fully-connected network, a recurrent-neural network, etc. We use a fully-connected embedding network (having two hidden layers with $256$ and $128$ nodes each), with an output that is $64$ -dimensional. With this embedding network, we observed no loss in performance, with a reduction of $1.8$ M in the number of parameters when comparing Eq. (3.7) with Eq. (3.5). Table 1 shows the context features for each NF. For NFs [math] to $4$ , there are less than $5$ preceding ECal layers, thus they have less than $502$ context features.555NF [math], which learns the distribution of the voxel energies of layer [math], does not use an embedding network, since it is only conditioned on $E_{0}$ and $E_{\text{inc}}$ .

Because of this conditioning scheme, generation happens recursively; however, training can happen in parallel on multiple GPUs, since all required context features are derived from the training data. For example, to generate the voxel energies in layer $2$ , those of layers [math] and $1$ must be generated first, and then NF $2$ is conditioned on the voxel energies of the previous layers as well as $E_{2}$ and $E_{\text{inc}}$ . In this way, the whole calorimeter can be traversed. The architecture of the causal flows is visualized in Fig. 2. A more detailed description of the model can again be found in App. A.

During generation, it turns out that the conditioning on the $E_{i}$ is not sufficient to guarantee that the energies per layer by the sampled showers equal $E_{i}$ . Hence, some postprocessing like rescaling to the $E_{i}$ of the energy distribution flow and a thresholding of low-energy voxels is necessary. We detail our method in App. B and illustrate its effect in Fig. 13.

The causal flows have in total $44.8$ M parameters, and they were trained on a single NVIDIA® V100® with $32$ GB VRAM for about $55$ GPU-hours. As for the energy distribution flow, we use the validation NLL to select the best checkpoint among the $200$ epochs.

4 Results

We now evaluate the performance of the L $2$ LFlows approach. We benchmark it against a state-of-the-art shower generation model based on the BIB-AE framework, adapted from Ref. [18] and modified to operate on the photon showers with shape $30\times 10\times 10$ by retraining it. The BIB-AE consists of an encoder and a decoder pair, which is trained using a set of adversarial critics. The BIB-AE generation process employs an additional post-processing step and a Kernel-Density-Estimation–based latent sampling, as described in Ref. [18]. The BIB-AE model and PostProcessor model have a combined total of $9.3$ M parameters, while the critics used to train them have an additional $3.7$ M parameters.

4.1 Distributions

Figure 3 shows a single test shower of Geant4 as well as a generated shower from the BIB-AE and L $2$ LFlows each. All single showers have an incident energy $E_{\text{inc}}\approx 50$ GeV. We see that the individual shower from L $2$ LFlows looks reasonable, with a broadly realistic morphology of voxels and energy depositions.

Figure 4 shows the overlay of $95$ k showers, i.e. the mean of the voxel energies of $95$ k showers. In order to create two-dimensional plots, the voxel energies are summed over the $z$ -, $x$ - or $y$ -axis. For Geant4, the $95$ k test showers are used. To highlight potential differences for the BIB-AE and L $2$ LFlows, we show the absolute relative deviation to Geant4 for both generative networks per voxel:

[TABLE]

where $i$ and $j$ denote voxel positions. We observe that in general the generative models capture the overlay quite well, with L $2$ LFlows having smaller deviations from Geant4 than the BIB-AE.

To compare the performance of the generative models in more detail, we start by looking at the showers on the voxel level. Figure 5 shows the distributions of voxel energies as well as the sparsity, i.e. the number of non-zero voxels per shower. One characteristic that repeats itself in several histograms is that the BIB-AE is not capable of capturing the full Geant4 distribution, which can e.g. be seen in the sparsity plot. L $2$ LFlows is much better in this regard. Further, the energy deposited around the energy of a minimum ionizing particle (MIP) in the voxel distribution is better modeled by L $2$ LFlows in comparison to the BIB-AE, which slightly overshoots it. While L $2$ LFlows does not learn the Geant4 distribution perfectly, it learns the distributions much better than the BIB-AE.

For $E_{\text{inc}}\in\{20,80\}$ GeV, Fig. 6 shows the energy profiles in $x$ -, $y$ - and $z$ -direction. As can be seen, the larger the incident energy $E_{\text{inc}}$ , the more the maximum in the energy profiles shifts to later layers, which both the BIB-AE and L $2$ LFlows are able to learn. Deviations for both simulators mainly exist in a few initial and final layers.

The distributions in Fig. 7 show the total energy depositions ( $E_{\text{depos}}:=\sum_{i}E_{i}$ ), both for continuous incident energies uniformly distributed in $[10,100]$ GeV (left) and for discrete incident energies $E_{\text{inc}}\in\{20,50,80\}$ GeV (right). In both of these distributions we observe that L $2$ LFlows is much closer to the Geant4 distribution than the BIB-AE.

Figure 8 shows the linearity666This does not correspond to the actual calorimeter linearity or resolution, as the increased thickness of the last $10$ ECal layers is not calibrated for. It is, however, still a vital means for determining the performance of the generative approaches. (and its relative deviation to Geant4) as well as the width (again with its relative deviation).777The linearity $\mu_{90}$ is defined as the mean deposited energy over the ECal for discrete $E_{\text{inc}}$ of a $90\%$ subset of the samples that have the smallest range. The width $\rho_{90}$ is defined as $\rho_{90}:=\mu_{90}/\sigma_{90}$ , where $\sigma_{90}$ is the standard deviation of the $90\%$ subset of the energy deposition samples that have the smallest range. For the linearity, the relative deviation is for the BIB-AE maximally about $1\%$ , for L $2$ LFlows the deviation is everywhere below $0.75\%$ . For the width plot,888One might be tempted to call $\rho_{90}$ the “resolution”, but because of the different thicknesses of the tungsten absorber layers, cf. Sec. 2, this is not the case [12]. the relative deviation for L $2$ LFlows is everywhere below $5\%$ , whereas for the BIB-AE, the maximum deviation is about $15\%$ .

It is also interesting to examine the ratio of $E_{\text{depos}}$ over $E_{\text{inc}}$ plotted as a function of $E_{\text{inc}}$ . The upper row of Fig. 9 shows that the functional form of the ratio is not constant for Geant4. While a perfect calorimeter would yield a constant ratio for Geant4, in practice, because of leakage and the increased thickness of the last ten absorber layers, the curve falls off over the range. The fact that the ratio of the deposited over the incident energy is only $\mathcal{O}(1\%)$ is expected, as the ILD ECal is a sampling calorimeter. As becomes apparent from Fig. 9, L $2$ LFlows learns the functional form much better than the BIB-AE. In particular, the BIB-AE has problems at the edges. At the left edge, i.e. for $E_{\text{inc}}\approx 10$ GeV, ratios of $2\%$ and more are too populated compared to Geant4, yet ratios of around $1.5\%$ and less are too thinly populated. At the right edge, i.e. for $E_{\text{inc}}\approx 100$ GeV, the functional form falls off too quickly. Further, in the middle row of Fig. 9, we show the sparsity plotted against $E_{\text{depos}}$ . The BIB-AE learns a distribution that is thinner compared to the one from Geant4, and its core has too many occurrences. For L $2$ LFlows, the agreement to the Geant4 distribution is much better, and differences are barely visible by eye. Finally, the last row of Fig. 9 shows the $2$ D correlations for the center of gravity in $z$ -direction versus the total deposited energy. It can be seen that the BIB-AE is yet again not capturing the full distribution, as its $2$ D plot is more compact compared to Geant4. In contrast, L $2$ LFlows exhibits a superb performance.

In addition, Fig. 10 shows correlation matrices for pairwise Pearson correlation coefficients between several high-level observables for Geant4 and the difference of Geant4 to the BIB-AE and L $2$ LFlows. The observables are, in order of appearance, the first and second moments along the $x$ , $y$ , and $z$ directions, the visible energy sum, the incident photon energy, the number of hits, and the energy fractions in the three thirds of the calorimeter along the $z$ -directions. More details can be found in Ref. [12]. It can be seen that both generative models correctly describe a large number of the investigated pair-wise correlations. Both models do, however, struggle with specific correlations, involving the second moments in the $x$ - and $y$ -direction.

Finally, Fig. 11 shows the total energy depositions per layer for four selected layers. In layers $2$ , $8$ , $14$ and $26$ , L $2$ LFlows is at least comparable to the BIB-AE, if not better.

Judging from the histograms and plots that have been shown so far, L $2$ LFlows seems to outperform the BIB-AE in almost every single physics quantity, however it does slightly worse in capturing pairwise correlations.

In order to judge the performance more comprehensively in the full multivariate phase space, various metrics have been suggested in [14, 47, 48]. For this comparison, we turn to classifier-based tests described in [14, 48] in the following subsection, and leave the exploration of other metrics suggested in [47] as a future research direction.

4.2 Classifier Tests

As in [14, 15, 22], we now turn to a classifier-based metric to evaluate the quality of the generated showers in the full $3000$ -dimensional phase space. In total, two binary fully-connected classifiers are trained, one on Geant4 vs BIB-AE generated showers, the other on Geant4 vs L $2$ LFlows generated showers. Both classifiers have the same architecture and make use of the same hyperparameters; details can be found in App. C. The idea of the classifier metric is that if the classifier is optimal, then by the Neyman-Pearson lemma it directly computes the likelihood ratio $p_{\text{generated}}(x)/p_{\text{reference}}(x)$ in the full phase space. A perfect generative model should have $p_{\text{generated}}=p_{\text{reference}}$ and optimal classifier scores that are identically 0.5.999Indeed, we find an AUC of 0.5 when training on Geant4 vs Geant4 samples. For an imperfect generative model, the optimal classifier should be the most powerful detector of any deviations from $p_{\text{generated}}=p_{\text{reference}}$ .

Of course, it is never possible, given finite samples and finite model capacity, to learn the truly optimal classifier. Therefore, the classifier metric we evaluate here is at best an approximate measure of model quality. At most, we could expect the classifier AUC score we obtain here to be a lower bound on the true AUC score that would be given by the optimal classifier. However, given identical model architectures and training set sizes, we expect the relative comparison of binary classifier scores between Geant4 vs BIB-AE and Geant4 vs L $2$ LFlows to still be meaningful and informative.

The results of $10$ classifier trainings are shown in Tab. 2. As can be observed, the BIB-AE–generated showers allow for almost perfect classification, which reflects itself in an AUC close to $1$ . The L $2$ LFlows-generated showers, on the other hand, are much better able to fool such a classifier. However, we note that there is still some separation power to Geant4-generated showers, as the mean AUC of the classifiers is far away from $0.5$ .

In Tab. 2, we have also gone beyond previous works, to study the dependence of the classifier metric on training sample size. (We only studied the dependence on training sample size for L $2$ LFlows, since the BIB-AE is very close to $1$ when trained on only $95$ k showers.) As becomes apparent, the mean AUC of L $2$ LFlows worsens with more showers, which is unsurprising, as with more statistics, the classifier can find more differences between the Geant4- and L $2$ LFlows-generated showers. At an even larger number of showers used for classifier training, we would expect the finite size of the generator training set to become an issue, too [49, 50, 51]. Nevertheless, we observe that for a given number of showers, the BIB-AE showers are more separable from Geant4 than the L $2$ LFlows showers, indicating a better performance of L $2$ LFlows. Also, even though the AUC scores for Geant4 vs. L $2$ LFlows are worsening with more training data (and may be asymptoting to $1$ , there is insufficient training data to say for sure), the fact that they are not immediately close to $1$ (as is the case for Geant4 vs. BIB-AE) is a further indication that the L $2$ LFlows showers are of higher quality.

To further test the relative quality of L $2$ LFlows vs. BIB-AE, we use the new Multi-Model Classifier Metric proposed in [48]. Instead of training separate binary classifiers between each generated model and the reference data, which can be constrained by limited amounts of the latter, we instead train a classifier (potentially multi-class) between the different generative models. This learns the probability that a shower came from each model. Then we evaluate this classifier on Geant4, BIB-AE and L $2$ LFlows showers, and see which model the classifier prefers. Note that while there are some limitations to the use of classifiers as an absolute metric discussed earlier, we expect the interpretation in a relative sense (as is done here) to be more straighforward.

For this test, we use $760$ k showers sampled from the BIB-AE and L $2$ LFlows each. Just as for the Geant4 vs BIB-AE/L $2$ LFlows classifier, we make a $60\%:20\%:20\%$ split to obtain training, validation and test showers of the classifier. The AUC of the classifier on the test dataset (where we evaluate the checkpoint with highest validation accuracy) is $1.0000$ , implying that a fully-connected classifier has no trouble distinguishing between BIB-AE and L $2$ LFlows showers. The architecture and hyperparameters of this BIB-AE vs L $2$ LFlows classifier are identical to the Geant4 vs BIB-AE/L $2$ LFlows classifiers; with the exception of the number of training epochs, see details in App. C.

For evaluation, we consider the test sets of the classifier containing $152$ k showers for the BIB-AE and L $2$ LFlows each and use $152$ k Geant4 test showers to compare the classifier outputs. The means of the output probabilities $p(\textsc{L$ 2 $LFlows}|x)$ for $x$ coming from BIB-AE, L $2$ LFlows, and Geant4 are $0.03\%$ , $99.91\%$ , and $98.84\%$ respectively. This indicates that Geant4 and L $2$ LFlows-generated showers are much closer to each other than Geant4 and BIB-AE-generated showers are. To further visualize this result, we plot the predictions of the classifier on the test showers in Fig. 12. This also shows us that Geant4 showers are on average more likely to be identified as coming from L $2$ LFlows than BIB-AE by the classifier. All of this strenghtens our conclusion that L $2$ LFlows captures the underlying shower distribution of Geant4 much better than the BIB-AE.

4.3 Shower Generation Timings

Table 3 shows the mean sampling time per shower for Geant4, the BIB-AE and L $2$ LFlows. For L $2$ LFlows, the sampling times of the energy distribution flow are not accounted for, as they are negligibly small compared to the mean shower generation times of the causal flows. For Geant4, the same number as in Ref. [12] is taken, as cropping the dataset from $30\times 30\times 30$ to $30\times 10\times 10$ was done once the Geant4 showers were simulated in the full ECal prototype.101010Simulating the showers in $30\times 10\times 10$ would be unphysical, since this would not take into account backscattering for example. Also, we do not expect a large difference between generation timings for showers simulated in a $30\times 30\times 30$ cube or a $30\times 10\times 10$ cuboid, since we focused ourselves on the core of the showers, where most energy depositions happen. We note that, in contrast to Geant4, shower generation times for L $2$ LFlows and the BIB-AE do not depend on the incident energy.

Since the generation times of a MAF scale with the dimensionality $d$ of the input samples, one can expect the sampling times for L $2$ LFlows to worsen by a factor of $9$ when going from the $30\times 10\times 10$ to the full $30\times 30\times 30$ data, while the Geant4 run time would stay the same111111For the BIB-AE, the mean sampling times on the full dataset can be found in Ref. [12].. The main bottleneck is not our autoregressive treatment of the ECal layers, but more the MAFs with which we model every single ECal layer.

The speedups obtained on the cropped dataset are for L $2$ LFlows up to a factor of $200$ slower than the BIB-AE (with a batch size of $1$ on the CPU), and in comparison to Geant4, L $2$ LFlows is only a factor of $3$ faster on the CPU (with a batch size of $1000$ ). On the GPU, L $2$ LFlows is about $470$ times faster than Geant4 (with a batch size of $128000$ ), whereas the BIB-AE can obtain a speedup of about $16000$ (with a batch size of $2000$ ). Reference [15] also observed mean sampling times that were much slower than their GAN baseline network from Ref. [6, 7], and to combat this, a MAF-IAF setup using probability density distillation, inspired by Ref. [52], was used. The obtained speedup was a factor of $\mathcal{O}(d)$ , with a negligible loss in shower quality. Here, IAF refers to the inverse autoregressive flow [53], an alternative architecture for autoregressive flows that we detail in App. A. Applying the same MAF-IAF concept to this work is an interesting future research direction; if it works, a speedup $\mathcal{O}(100)$ can be expected. This implies that L $2$ LFlows has the potential to outperform the BIB-AE not just in the fidelity, but also in the speed with which the generated showers are obtained.

5 Conclusions and Outlook

This work built on Ref. [14] and demonstrated for the first time that NFs can be used to generate high-fidelity showers in a highly-granular sampling calorimeter. Showers were generated in a two-step approach, where the energy distribution flow first learned the energy depositions per ECal layer. Then, $30$ NFs (one per layer) — which we dubbed causal flows — were used to learn the voxel distributions, while being conditioned on the total deposited energy in that layer, the incident energy of the photons, and the voxel energies of the previous $5$ layers. The use of fully-connected embedding networks, which distill the conditioning features, cf. Tab. 1, further reduces the number of parameters with no loss in performance. It was found that for all considered distributions in Sec. 4.1, L $2$ LFlows either outperforms the BIB-AE or is as good as it, with the exceptions of correlations, where the BIB-AE performs slightly better.

Further, L $2$ LFlows has a much better AUC than the state-of-the-art network on the dataset — the BIB-AE — in the classifier tests. The classifiers used in this work took as input both Geant4-simulated and neural network-generated showers as well as the incident energies of the photons: The BIB-AE yields an AUC of $0.9947\pm 0.0025$ , whereas L $2$ LFlows leads to an AUC of $0.8518\pm 0.0042$ . We also trained a classifier directly on BIB-AE and L $2$ LFlows showers. As shown, when taking Geant4 showers as input, such a classifier is much more likely to label it as an L $2$ LFlows instead of a BIB-AE shower, further indicating the superior quality of the proposed approach. It was further shown that L $2$ LFlows outperforms the BIB-AE in almost every considered physics distribution.

The two-step approach, which was first introduced in Ref. [14], can also be applied to other generative networks. For example, adding an energy distribution flow to the BIB-AE approach may also improve the fidelity of the generated showers there.

One bottleneck of the developed approach, however, is the required sampling time per shower. The problem is generally that a MAF is slow during generation, as it sequentially calculates each output dimension of a sample during generation. If the MAF-IAF approach from Ref. [15] also succeeds in training the much faster IAF for L $2$ LFlows, then a potential speedup factor of $\mathcal{O}(100)$ could be obtained. This would result in an NF architecture that could be used to sample faster than the BIB-AE.

Although the dataset has a uniform number of voxels in each layer, L $2$ LFlows could straightforwardly generalize to non-uniform cases. In addition, splitting the learning of the shower shape into several NFs also has the advantage that the training can be parallelized on several GPUs. However, one pays a price that, even when employing an IAF setup, $30$ individual NFs evaluations are required. This limits the speedup of the proposed IAF setup to $\mathcal{O}(100)$ , as opposed to the $\mathcal{O}(3000)$ speedup that could be achieved by a single-flow approach.

Further, we believe our NF architecture to have applications beyond the use in high-energy physics. L $2$ LFlows could in principle also be studied for image or video generation. For example, Ref. [54] uses an NF to generate high-fidelity images, yet for large images, a batch size of $1$ was used during training. To mitigate these memory constraints, it might be possible to use not only a single NF, but several of them, where each NF sees only a subset of the pixels, yet is conditioned on the previous pixels. The NFs could learn the full image in a top-bottom approach, where the first NF learns the first set of pixels, the second NF learns the second set while being conditioned on the first set, and so on. Image or video generation would then happen sequentially. To the best of our knowledge, such an approach has not been considered in the literature yet, and since each NF can be trained separately, a higher batch size can be chosen. This proposed approach is very similar to autoregressive models such as PixelRNN [55], PixelCNN [56], PixelCNN++ [57] or PixelSNAIL [58], but instead of generating the image pixel by pixel, chunks of pixels would be generated at once.

As a proof of concept that NFs also scale to higher-dimensional datasets, this work cut down the $30\times 30\times 30$ projection to a $30\times 10\times 10$ projection. An extension of this work to the full projection is believed to be straightforward, as every NF would then have to learn a $900$ - instead of a $100$ -dimensional PDF, which should be feasible to tackle computationally with L $2$ LFlows. In addition, this work can be extended by not just studying photon showers in the ILD ECal, but also pion showers in an HCal prototype for the ILD, which was done for the BIB-AE in Ref. [18], where the HCal was projected to a cuboid of size $48\times 25\times 25$ . With L $2$ LFlows, this would result in $48$ NFs, where each NF has to learn a $625$ -dimensional distribution.

It is also important to perform angular conditioning studies in the future, as the dataset used in this work shot the photons perpendicularly into the ECal. And just as Ref. [18] considered the output of state-of-the-art reconstruction algorithms on the output of the neural network–generated showers, it would be interesting to do the same once L $2$ LFlows has been extended to the full $30\times 30\times 30$ dataset of the ILD ECal prototype. Last but not least, L $2$ LFlows can be studied for the three different datasets from the CaloChallenge [36].

Acknowledgments

The authors would like to thank Katja Krüger for valuable feedback on the draft of this paper. SD is funded by the Deutsche Forschungsgemeinschaft under Germany’s Excellence Strategy – EXC 2121 Quantum Universe – 390833306. The work of CK and DS was supported by DOE grant DOE-SC0010008. This research was supported in part through the Maxwell computational resources operated at Deutsches Elektronen-Synchrotron DESY, Hamburg, Germany. CK would like to thank the Baden-Württemberg-Stiftung for support through the program Internationale Spitzenforschung, project Uncertainties — Teaching AI its Limits (BWST_IF2020-010).

Appendix A Model Details

L $2$ LFlows uses MAFs [41] as generative models. A MAF is a bijective function, connecting the data distribution on one side with a Gaussian latent distribution on the other side. The transformation is given by a set of rational quadratic splines [42, 43] for each dimension. The parameters of the splines (like location of bin edges or derivative values at the knots) are given by an autoregressive neural network, ensuring that the parameters of the transformation of $x_{i}$ only depend on $x_{<i}$ . In practice, such an autoregressive structure can be realized by masking the network weights with zeros in appropriate places, as done in the MADE block [44]. A single pass through the network will then give the full set of autoregressive parameters needed for the entire transformation. The parameters for the inverse transformation are, however, harder to obtain. Passing random input through the MADE block will only give the correct parameters of the transformation of the first coordinate, since these do not depend on any other $x_{i}$ . Using these parameters to correctly invert $x_{i}$ , we can get the correct parameters to invert $x_{2}$ after a second pass through the MADE block. In total we therefore need $d$ passes through the MADE block to fully construct the inverse transformation. The MAF now uses the fast pass to compute the log-likelihood, allowing for a fast training at the price of slow sampling. The IAF would allow for faster sampling, but only at the expense of much slower (or even impossible because of memory constraints) training. More details about MAFs and IAFs and their use for calorimeter simulations can be found in [15].

The details of the energy distribution flow are summarized in the left column of Tab. 4. The MADE blocks make use of fully-connected layers, and their total number per MADE block is given by the input layer, the number of hidden layers and the output layer. The hidden layers inside the MADE blocks make use of the ReLU activation function. The minimum bin width and height refer to the minimum values of an RQS bin, and the minimum derivative to the minimum derivative value at the knots of a bin; and the permutation hyperparameter refers to the permutation of the input features before they are passed to the MADE block. Since the transformations parameterized by each MADE block are autoregressive in nature, the permutation layers help to increase the expressivity of the normalizing flow. Here we differentiate between “random”, which denotes a randomly determined permutation, and “reverse”, where the permutation inverts the ordering of the data dimensions. During generation, double precision parameters are used, as we found that using only float precision parameters leads to numerical instabilities, which arose when the RQS solved a quadratic equation for the inverse [43, 42]. During training, however, there is almost no advantage of using double precision, as we checked in an ablation study. Since up to $50\%$ of the memory can be saved that way, we use single float precision during training.

The hyperparameters of the causal flows can be found in the right column of Tab. 4. Just as for the energy distribution flow, an ablation study showed barely any advantage of double precision parameters for training, hence we use only float precision (generation still happens with double precision parameters).

Appendix B Pre- and Postprocessing

B.1 Preprocessing energy distribution flow

The training input of the energy distribution flow consists of the energies per layer $E_{i}$ . These energies are preprocessed before they are passed to the energy distribution flow. In the first step, the layer energies $E_{i}$ are smeared using an additive Gaussian noise term, with mean $\mu=1$ keV and standard deviation $\sigma=0.2$ keV121212We clip all negative values to [math].. This helps the energy distribution flow learn the marginal distributions of $E_{i}$ and results in a noticeable performance increase compared to training without the added noise.

The smeared energies per layer $E_{i}$ , are then further processed using

[TABLE]

where the smeared $E_{i}$ is given in units of GeV and $\ell_{a}$ describes a modified logit function [41]:

[TABLE]

with the scalable hyperparameter $\alpha$ , chosen to be $\alpha=10^{-6}$ . Note that the logit transformation is only defined for $x\in[0,1]$ , however, as the energies per calorimeter layer in the ILD ECAL do not exceed $1$ GeV in the photon shower data set, this requirement is inherently fulfilled.

The energy distribution flow is conditioned on the energy of the incident particle $E_{\text{inc}}$ , which is processed according to

[TABLE]

where $E_{\text{inc}}$ is given in units of GeV.

B.2 Preprocessing causal flows

The inputs to the individual NFs of the causal flows consist of the energy depositions in the voxels of a layer, $\mathcal{I}_{j}$ , with $j\in[0,29]$ , and $\mathcal{I}_{j}\in\mathbb{R}^{100}$ . Before being passed to the NFs, the inputs are preprocessed. The first step consists of adding noise (sampled uniformly from $[0,1]^{100}$ keV) to each $\mathcal{I}_{j}$ . The resulting values are then normalized and logit-transformed in accordance with

[TABLE]

where max denotes the maximum voxel energy taken over all training data and $\ell_{\alpha}$ is the logit transformation defined in Eq. (B.2).

As described in Sec. 3.2, NF $i$ is conditioned on the voxel energies of the previous $5$ layers, as well as on the energy in their respective layer $E_{i}$ and the energy of the incident particle $E_{\text{inc}}$ . The preprocessing of $E_{\text{inc}}$ is identical to what is given by Eq. (B.3). During training, the layer energies $E_{i}$ are derived from the voxel energies in the $i$ -th layer $\mathcal{I}_{i}$ using

[TABLE]

where $\mathcal{I}_{i}^{\text{cut}}$ are the voxel energies after the threshold cut, defined by

[TABLE]

and the noise in Eq. (B.5) refers to the Gaussian noise added during training, cf. Tab. 4.

The $E_{i}$ are then further processed according to

[TABLE]

with $\epsilon=10^{-6}$ . During generation, the $E_{i}$ come from the energy distribution flow and they are also processed according to Eq. (B.7).

B.3 Postprocessing causal flows

The postprocessing used in Ref. [14] ensures via a renormalization of the $\mathcal{I}_{i}$ that the energies per layer of the returned showers are approximately those that the energy distribution flow dictates. Usually, one is interested in showers that have an energy threshold applied, since the inherent electronic noise of the detector implies that too small energies cannot be converted into a signal that can be read out. This thresholding reduces the deposited energy per layer below the energy dictated by the energy distribution flow. However, the study of Ref. [14] is based on the dataset from Ref. [7], where the energies both from the passive absorber layers and the active detector layers are assumed to be available, and applying an energy threshold on the generated showers barely made a difference in the energies per layer.

This work uses a realistic sampling calorimeter, where the energies from the passive absorber layers are unavailable. For this reason, the voxel depositions are much smaller, and the cut makes a non-negligible difference. Hence, a new postprocessing was needed for this work. In a nutshell, the new postprocessing checks how many of the dimmest voxels need to be set to zero such that the renormalized remaining voxels are all above the threshold.

We define $S(\mathcal{I}_{i}):=\sum_{i}\mathcal{I}_{i}$ , the sum over voxels $\mathcal{I}_{i}$ in layer $i$ and $\mathcal{I}_{i,\geq t}$ as the voxels $\mathcal{I}_{i}$ thresholded by $t$ , i.e. all voxels less than $t$ in layer $i$ are set to zero. The postprocessed voxels are then given by

[TABLE]

where $t$ is set by requiring

[TABLE]

As a result, all generated voxel energies of layer $i$ sum to $E_{i}$ after the threshold cut is applied. Figure 13 illustrates the effect of the postprocessing. There, we see the distribution of energies in layers $2$ , $8$ , $14$ , and $26$ as given by different algorithms: In light green, we see the distributions as given by the energy distribution flow; in orange, the distributions of raw, generated showers without a threshold cut (which only scatters around the $E_{i}$ of the energy distribution flow, even though it was conditioned on it); in red, the distributions of raw showers after the CaloFLow postprocessing (renormalization and subsequent threshold cut); and in blue, the distributions of the same showers using our postprocessing. We clearly see that a simple application of the threshold cut after the renormalization distorts the distribution towards lower energies. This effect is stronger in the outer layers of the calorimeter, where the overall scale of energy depositions is smaller.

Appendix C Classifier Tests: Architectures and Details

Both the Geant4 vs BIB-AE as well as the Geant4 vs L $2$ LFlows classifiers are fully-connected neural networks with the same architecture and hyperparameters: They consist of four hidden layers with $4096$ , $512$ , $64$ and $8$ nodes with the LeakyReLU activation function, using a slope of $0.01$ for input that is smaller than [math]. The output layer has $2$ nodes, and we use the cross entropy loss. Each output node can be interpreted as the likelihood of a given sample belonging to Geant4 or the BIB-AE/L $2$ LFlows. Therefore, the likelihood ratio is available in the binary classification setup. In total, every classifier has $14.4$ M parameters. The classifier is trained on input with double precision. The learning rate is set to $10^{-4}$ , the batch size to $256$ and both classifiers are trained for in total $50$ epochs. The final model is chosen that has the highest validation accuracy.

Unlike Ref. [14], which considers the ECal voxel energies, (log-transformed) deposited energies per ECal layer as well as the (log-transformed) incident energy as input to the classifiers, we believe the use of the deposited energies per layer to be redundant, as they are already encoded in the ECal voxel energies. Thus, we make use of $3001$ input features, where the incident energies are log-transformed as in Eq. (B.3). The voxel energies have half the MIP cutoff applied.

The only difference between the Geant4 vs BIB-AE/L $2$ LFlows and BIB-AE vs L $2$ LFlows classifier is that for convergence reasons of the validation accuracy, the latter is trained for $100$ epochs.

Bibliography58

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] S. Agostinelli, J. Allison, K. Amako, J. Apostolakis, H. Araujo, P. Arce et al., Geant 4—a simulation toolkit , Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 506 (2003) 250 . · doi ↗
2[2] J. Allison, K. Amako, J. Apostolakis, H. Araujo, P. Arce Dubois, M. Asai et al., Geant 4 developments and applications , IEEE Transactions on Nuclear Science 53 (2006) 270 . · doi ↗
3[3] J. Allison, K. Amako, J. Apostolakis, P. Arce, M. Asai, T. Aso et al., Recent developments in geant 4 , Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 835 (2016) 186 . · doi ↗
4[4] ATLAS Collaboration, ATLAS Software and Computing HL-LHC Roadmap , Tech. Rep. , CERN, Geneva (2022).
5[5] CMS Offline Software and Computing, CMS Phase-2 Computing Model: Update Document , Tech. Rep. , CERN, Geneva (2022).
6[6] M. Paganini, L. de Oliveira and B. Nachman, Accelerating Science with Generative Adversarial Networks: An Application to 3D Particle Showers in Multilayer Calorimeters , Phys. Rev. Lett. 120 (2018) 042003 [ 1705.02355 ]. · doi ↗
7[7] M. Paganini, L. de Oliveira and B. Nachman, Calo GAN: Simulating 3d high energy particle showers in multilayer electromagnetic calorimeters with generative adversarial networks , Physical Review D 97 (2018) [ 1712.10321 ]. · doi ↗
8[8] L. de Oliveira, M. Paganini and B. Nachman, Controlling Physical Attributes in GAN-Accelerated Simulation of Electromagnetic Calorimeters , J. Phys. Conf. Ser. 1085 (2018) 042017 [ 1711.08813 ]. · doi ↗

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

L2\bm{2}2LFlows: Generating High-Fidelity 3\bm{3}3D Calorimeter Images

Abstract

1 Introduction

2 Dataset

3 L222LFlows

3.1 energy distribution flow

3.2 causal flows

4 Results

4.1 Distributions

4.2 Classifier Tests

4.3 Shower Generation Timings

5 Conclusions and Outlook

Appendix A Model Details

Appendix B Pre- and Postprocessing

B.1 Preprocessing energy distribution flow

B.2 Preprocessing causal flows

B.3 Postprocessing causal flows

Appendix C Classifier Tests: Architectures and Details

L $\bm{2}$ LFlows: Generating High-Fidelity $\bm{3}$ D Calorimeter Images

3 L $2$ LFlows