TL;DR
This paper introduces a new metric based on earth mover's distance to quantify the similarity between collider events, enabling more flexible and data-driven analysis of collider data without predefined observables.
Contribution
It develops a novel metric for collider events using earth mover's distance, connecting it to infrared-safe observables and facilitating model-independent data analysis.
Findings
The metric effectively measures event similarity.
It relates to key physical observables.
Enables visualization and analysis without specific observables.
Abstract
When are two collider events similar? Despite the simplicity and generality of this question, there is no established notion of the distance between two events. To address this question, we develop a metric for the space of collider events based on the earth mover's distance: the "work" required to rearrange the radiation pattern of one event into another. We expose interesting connections between this metric and the structure of infrared- and collinear-safe observables, providing a novel technique to quantify event modifications due to hadronization, pileup, and detector effects. We showcase how this metrization unlocks powerful new tools for analyzing and visualizing collider data without relying upon a choice of observables. More broadly, this framework paves the way for data-driven collider phenomenology without specialized observables or machine learning models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
The Metric Space of Collider Events
Patrick T. Komiske
Eric M. Metodiev
Jesse Thaler
Center for Theoretical Physics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
Department of Physics, Harvard University, Cambridge, MA 02138, USA
Abstract
When are two collider events similar? Despite the simplicity and generality of this question, there is no established notion of the distance between two events. To address this question, we develop a metric for the space of collider events based on the earth mover’s distance: the “work” required to rearrange the radiation pattern of one event into another. We expose interesting connections between this metric and the structure of infrared- and collinear-safe observables, providing a novel technique to quantify event modifications due to hadronization, pileup, and detector effects. We showcase how this metrization unlocks powerful new tools for analyzing and visualizing collider data without relying upon a choice of observables. More broadly, this framework paves the way for data-driven collider phenomenology without specialized observables or machine learning models.
††preprint: MIT-CTP 5102
High-energy particle collisions produce a tremendous number of intricately correlated particles, especially when energetic quarks and gluons are involved. Behind this apparent complexity, however, the overall flow of energy in an event is a robust memory of its simpler partonic origins Sterman and Weinberg (1977); Georgi and Machacek (1977); Donoghue et al. (1979); Altarelli (1982); Dokshitzer et al. (1991); Tkachov (1997); Sveshnikov and Tkachov (1996); Hofman and Maldacena (2008). Surprisingly, no definition of the similarity between events presently exists that sharply captures this correspondence. In the absence of a metric, efforts typically fall back upon ad hoc methods such as comparing specific observables Cacciari and Salam (2008); Cacciari et al. (2015); Bertolini et al. (2014); Arjona Martínez et al. (2018); Komiske et al. (2017) or matching the pixels of calorimeter images Komiske et al. (2017); Cogan et al. (2015); de Oliveira et al. (2016); Paganini et al. (2018a, b). These approaches suffer from significant pathologies: disparate event topologies can give rise to identical observable values, while pixels lack stability under small perturbations. A theoretically and experimentally robust definition of the “distance” between events would profoundly expand our ability to explore the structure of collider data and unlock entirely new ways to probe events.
In this letter, we advocate for the earth (or energy) mover’s distance (EMD) Peleg et al. (1989); Rubner et al. (1998, 2000); Pele and Werman (2008); Pele and Taskar (2013) as a metric for the space of collider events. We propose a variant of the EMD, inspired by Refs. Pele and Werman (2008); Pele and Taskar (2013), that allows events with different total energies to be sensibly compared. The EMD is the minimum “work” required to rearrange one event into the other by movements of energy from particle in one event to particle in the other:
[TABLE]
where and index particles in events and , respectively, is the particle energy, is an angular distance between particles, and is the smaller of the two total energies. is a parameter that controls the relative importance of the two terms. While energies and angles are used here for clarity, we will use transverse momenta and rapidity-azimuth distances for our applications relevant for the Large Hadron Collider (LHC).
The EMD that we propose in Eq. (1) has dimensions of energy, where the first term quantifies the difference between the two radiation patterns and the second term accounts for the creation or destruction of energy. It is a true metric (satisfying the triangle inequality) as long as is a metric and , where is the maximum attainable angular distance between particles. For instance, must be at least the jet radius for conical jets. Formally, the EMD metrizes the energy flow, as it treats events differing only by soft particles or collinear splittings identically. This hints at a deep connection to infrared and collinear (IRC) safety of observables Kinoshita (1962); Lee and Nauenberg (1964); Brock et al. (1995); Weinberg (2005), which we explore further below.
A metric for comparing events is particularly relevant for probing the substructure of jets Seymour (1991, 1994); Butterworth et al. (2002, 2007, 2008); Abdesselam et al. (2011); Altheimer et al. (2012, 2014); Adams et al. (2015); Larkoski et al. (2017); Asquith et al. (2018), collimated sprays of particles resulting from the fragmentation and hadronization of high-energy quarks and gluons via quantum chromodynamics (QCD). Here, we will consider three classes of jets which have different intrinsic topologies: three-pronged boosted top quark jets, two-pronged boosted boson jets, and single-pronged QCD (quark or gluon) jets. We generate proton-proton collision events at the LHC with Pythia 8.235 Sjöstrand et al. (2015) at TeV including hadronization and multiple particle interactions. Anti- jets Cacciari et al. (2008) with a jet radius of are clustered using FastJet 3.3.1 Cacciari et al. (2012), and up to two jets with GeV and are kept. This selection is representative of an intermediate energy range for jets at the LHC and allows for sensitivity to the effects of both terms in Eq. (1). Jets are longitudinally boosted and rotated to center the jet four-momentum at as well as to vertically align the principal component of the constituent transverse momentum flow in the rapidity-azimuth plane; this removes the dependence of the EMD on these jet isometries.
We record the final-state hadrons, as well as the partons (before hadronization) and the hard /top decay products, that are within a jet radius of the jet four-momentum. We use the Python Optimal Transport Flamary and Courty (2017) library to compute EMDs with the minimal choice of , the jet radius. The energy difference penalty in Eq. (1) is implemented using a fictitious particle at a distance from all other particles. Fig. 1 shows the optimal energy movement between two example top jets.
We begin by highlighting a remarkable mathematical property of the EMD which provides a quantitative understanding of an observable’s sensitivity to the radiation pattern. Specifically, we relate the EMD to additive IRC-safe observables via the Kantorovich-Rubinstein Kantorovich and Rubinstein (1958) duality theorem. Applying this theorem to our variant of the EMD, we derive the following mathematical bound between two events and :
[TABLE]
where index , respectively, is the particle angular position, and is any -Lipschitz function (essentially, with gradient size bounded by ) which vanishes at the center of the space (e.g. the jet axis). The implications of Eq. (2) are simple yet profound: the similarity of events according to the EMD metric guarantees the closeness of their observable values in a precise way that depends on . By formulating IRC-safe observables in the language of additive energy-weighted structures Komiske et al. (2018, 2019), Eq. (2) can be applied to provide a robust bound.
As a concrete example, we demonstrate how the EMD bounds hadronization modifications of jet angularities Larkoski et al. (2014a) (see also Refs. Berger et al. (2003); Almeida et al. (2009); Ellis et al. (2010); Larkoski et al. (2014b)), where is the rapidity-azimuth distance to the jet axis. These angularities are evidently of the form in Eq. (2) with , which for is a -Lipschitz function over our jet cone, hence:
[TABLE]
The EMD between two events yields a robust upper bound of the difference in their angularity values. This bound is borne out in Fig. 2, where the angularity differences and EMDs are computed for the same QCD jets before and after hadronization. For this jet range, hadronization modifies events by EMD GeV and correspondingly modifies by no more than this amount. The intuitive picture of parton-hadron duality Dokshitzer et al. (1991), that the energy flow in an event is robust to nonperturbative effects, is quantified by considering the EMD that these nonperturbative effects can induce.
A metric space is also useful for classification without requiring specially designed observables or parametrized machine learning algorithms. One of the simplest examples of a non-parametric classifier is the -nearest neighbor (kNN) algorithm Cover and Hart (1967), whereby a given event’s closest neighbors in a reference set are used to determine class membership. We build a kNN classifier applied to the problem of discriminating jets from QCD jets using a balanced training sample of 100k total jets. The classifier output is the number of jets among the nearest neighbors by EMD. This method should approach the optimal IRC-safe classifier with a sufficiently large dataset. The performance of the resulting EMD kNN classifier is shown in Fig. 3 as a receiver operating characteristic (ROC) curve, with the Area Under the ROC Curve (AUC) also shown. For comparison, we include an Energy Flow Network (EFN) and a Particle Flow Network (PFN) Komiske et al. (2019) as well as a linear classifier trained on Energy Flow Polynomials (EFPs) Komiske et al. (2018). All classifiers are trained on a 100k training sample and evaluated on a 20k test sample, with the neural networks using 20% of the training sample for validation and a batch size of 125 (see Ref. Komiske et al. (2019) for additional details). The kNN approaches the performance of these state-of-the-art classifiers and significantly outperforms a ratio of -subjettiness observables Thaler and Van Tilburg (2011, 2012) designed to identify two-prong substructure. It is expected that the performance of the kNN method would improve with more sophisticated kernel density estimation techniques.
It is worth noting that while searching through a large reference set of events to find neighbors naively requires every possible pairwise comparison, in a metric space the triangle inequality can provide a great deal of simplification. Specialized data structures known as metric trees Uhlmann (1991); Yianilos (1993); Brin (1995); Bozkaya and Özsoyoglu (1999) have been developed to achieve query times that are approximately logarithmic in the size of the dataset. While we use direct searches throughout this letter, this is not a fundamental limitation and we leave metric tree query optimizations to future work.
Once a space has been equipped with a metric, it is natural to ask about the structure of the induced manifold. The most basic aspect of the manifold underlying the data is its dimension, and several notions of its intrinsic dimension exist Camastra (2003). The correlation dimension Grassberger and Procaccia (1983); Kégl (2002), a type of fractal dimension, is suitable for our purposes and is defined using only pairwise distances:
[TABLE]
where is the total number of events and the summand indicates whether event is within EMD of event .
The correlation dimension is an intrinsically scale-dependent quantity, which is particularly useful as we anticipate different physical effects to dominate jets at different scales. Shown in Fig. 4 is the intrinsic dimension of our top, , and QCD samples over energy scales ranging from 10 GeV to 1000 GeV obtained from Eq. (4) with 25k jets. At high energy scales , the EMD is governed by the hard decay kinematics, resulting in a relatively simple manifold with low intrinsic dimension. At energy scales approaching the fragmentation and hadronization scales, the structure of the events becomes increasingly complex and the dimension correspondingly increases. It is satisfying that the dimension is relatively low for a wide range of relevant energies, which is critical for a variety of metric-based techniques such as classification and low-dimensional visualization to work effectively with a realistic amount of data.
Beyond probing its dimension, the entire space of jets can be visualized using techniques such as t-Distributed Stochastic Neighbor Embedding (t-SNE) van der Maaten and Hinton (2008); van der Maaten (2009); van der Maaten and Hinton (2012); van der Maaten (2014), which finds a low-dimensional embedding of the data that attempts to respect the distances between points. Fig. 5 shows a t-SNE embedding of 5k jets with GeV into a two-dimensional manifold using scikit-learn Pedregosa et al. (2011). The narrower range focuses the EMD on the jet substructure and was found to yield sharper visualizations, with other choices also yielding sensible results. The jets populate a circular subspace roughly corresponding to the energy sharing of the two prongs. As the jet originates from a resonant decay, the two decay quarks (after rotation) are solely described by their energy sharing, which satisfyingly emerges from the manifold of jets. Moreover, the center of the ring, distant from the annulus, tends to contain the most complex jet topologies, resulting in a type of automatic anomaly detection.
Finally, we illustrate the use of EMD for a new kind of visualization strategy that clusters events to better understand observable distributions. To describe a given set of events, such as those in a histogram bin, we find the events (called medoids) which best describe the set in that the sum of distances of each event to its closest medoid is minimized. This procedure works for any observable and provides an immediate glimpse of the types of event topologies that correspond to a given observable value. We use an iterative approximation of -medoids from the pyclustering Python package Novikov (2018). As an illustration, Fig. 6 shows the jet mass for QCD jets with medoids per bin, providing a snapshot of the different event topologies at different masses.
In conclusion, we have equipped the space of events with a metric, thereby allowing a powerful suite of new tools and techniques to be directly applied to collider physics. There are many potential applications of the EMD at colliders beyond those presented here. Pileup mitigation or detector reconstruction could use the EMD to benchmark performance and thus benefit from the quantitative bounds on IRC-safe observable modifications. Further, machine learning models could be trained to optimize the EMD, related to recent efforts in generative modeling Arjovsky et al. (2017); Erdmann et al. (2018, 2019); Chekalina et al. (2018). By counting neighbors, one could also perform density estimation in the space of events Andreassen et al. (2019). While we have focused on jet substructure, analogous studies could be carried out at the event level, which may require working with composite objects such as jets for realistic computation times. It would be interesting to explore an EMD strategy for unfolding by matching detector-level and simulated events. One might consider alternatives to the EMD, such as symmetry-projected metrics Pele and Taskar (2013) or -Wasserstein metrics Wasserstein (1969); Dobrushin (1970) beyond our case, though our conclusions should hold for any physically sensible metric. Further, using the EMD for model-independent anomaly detection Collins et al. (2018); De Simone and Jacques (2019); Hajer et al. (2018); Heimel et al. (2019); Farina et al. (2018); Cerri et al. (2019); Collins et al. (2019) by finding isolated or clustered event topologies could empower searches for physics beyond the Standard Model at the LHC.
Acknowledgements.
We would like to thank Felice Frankel, Marat Freytsis, Paul Ginsparg, Aram Harrow, Gregor Kasieczka, Andrew Larkoski, Katherine Liu, Benjamin Nachman, Miruna Oprescu, Katherine Quinn, and Jonathan Walsh for helpful discussions. We benefited from the hospitality of the Harvard Center for the Fundamental Laws of Nature, the Fermilab Distinguished Scholars program, and the Aspen Center for Physics. This work was supported by the Office of Nuclear Physics of the U.S. Department of Energy (DOE) under Grant No. DE-SC0011090 and the DOE Office of High Energy Physics under Grant Nos. DE-SC0012567 and DE-SC0019128. JT is supported by the Simons Foundation through a Simons Fellowship in Theoretical Physics. Cloud computing resources were provided through a Microsoft Azure for Research award and through a Google Cloud allotment from the MIT Quest for Intelligence. Optimal transport provided by the 2018 Nissan Think Tank.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Sterman and Weinberg (1977) George F. Sterman and Steven Weinberg, “Jets from Quantum Chromodynamics,” Phys. Rev. Lett. 39 , 1436 (1977) . · doi ↗
- 2Georgi and Machacek (1977) Howard Georgi and Marie Machacek, “A Simple QCD Prediction of Jet Structure in e+ e- Annihilation,” Phys. Rev. Lett. 39 , 1237 (1977) . · doi ↗
- 3Donoghue et al. (1979) John F. Donoghue, F. E. Low, and So-Young Pi, “Tensor Analysis of Hadronic Jets in Quantum Chromodynamics,” Phys. Rev. D 20 , 2759 (1979) . · doi ↗
- 4Altarelli (1982) Guido Altarelli, “Partons in Quantum Chromodynamics,” Phys. Rept. 81 , 1 (1982) . · doi ↗
- 5Dokshitzer et al. (1991) Yuri L. Dokshitzer, Valery A. Khoze, and S. I. Troian, “On the concept of local parton hadron duality,” Jet Studies Workshop at LEP and HERA Durham, England, December 9-15, 1990 , J. Phys. G 17 , 1585–1587 (1991) . · doi ↗
- 6Tkachov (1997) Fyodor V. Tkachov, “Measuring multi - jet structure of hadronic energy flow or What is a jet?” Int. J. Mod. Phys. A 12 , 5411–5529 (1997) , ar Xiv:hep-ph/9601308 [hep-ph] . · doi ↗
- 7Sveshnikov and Tkachov (1996) N. A. Sveshnikov and F. V. Tkachov, “Jets and quantum field theory,” High-energy physics and quantum field theory. Proceedings, 10th International Workshop, Zvenigorod, Russia, September 20-26, 1995 , Phys. Lett. B 382 , 403–408 (1996) , ar Xiv:hep-ph/9512370 [hep-ph] . · doi ↗
- 8Hofman and Maldacena (2008) Diego M. Hofman and Juan Maldacena, “Conformal collider physics: Energy and charge correlations,” JHEP 05 , 012 (2008) , ar Xiv:0803.1467 [hep-th] . · doi ↗
