Exposing the QCD Splitting Function with CMS Open Data
Andrew Larkoski, Simone Marzani, Jesse Thaler, Aashish Tripathee, Wei, Xue

TL;DR
This paper demonstrates how to measure the QCD splitting function directly using jet substructure data from CMS Open Data, providing new insights into fundamental QCD properties at the LHC.
Contribution
First measurement of the QCD splitting function using CMS Open Data through jet substructure analysis, establishing a novel method for probing fundamental QCD properties.
Findings
Successful extraction of the QCD splitting function from CMS Open Data
Validation of jet substructure observable as a probe for QCD properties
First physics analysis based on publicly released CMS data
Abstract
The splitting function is a universal property of quantum chromodynamics (QCD) which describes how energy is shared between partons. Despite its ubiquitous appearance in many QCD calculations, the splitting function cannot be measured directly since it always appears multiplied by a collinear singularity factor. Recently, however, a new jet substructure observable was introduced which asymptotes to the splitting function for sufficiently high jet energies. This provides a way to expose the splitting function through jet substructure measurements at the Large Hadron Collider. In this letter, we use public data released by the CMS experiment to study the 2-prong substructure of jets and test the 1 -> 2 splitting function of QCD. To our knowledge, this is the first ever physics analysis based on the CMS Open Data.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Exposing the QCD Splitting Function with CMS Open Data
Andrew Larkoski
Physics Department, Reed College, Portland, OR 97202, USA
Simone Marzani
University at Buffalo, The State University of New York, Buffalo, NY 14260-1500, USA
Jesse Thaler
Center for Theoretical Physics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
Aashish Tripathee
Center for Theoretical Physics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
Wei Xue
Center for Theoretical Physics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
Abstract
The splitting function is a universal property of quantum chromodynamics (QCD) which describes how energy is shared between partons. Despite its ubiquitous appearance in many QCD calculations, the splitting function cannot be measured directly, since it always appears multiplied by a collinear singularity factor. Recently, however, a new jet substructure observable was introduced which asymptotes to the splitting function for sufficiently high jet energies. This provides a way to expose the splitting function through jet substructure measurements at the Large Hadron Collider. In this letter, we use public data released by the CMS experiment to study the 2-prong substructure of jets and test the splitting function of QCD. To our knowledge, this is the first ever physics analysis based on the CMS Open Data.
††preprint: MIT-CTP 4891
Quantum chromodynamics (QCD), like any weakly coupled gauge theory, exhibits universal behavior in the small angle limit. When two partons become collinear in QCD, the cross section for a scattering process factorizes into a scattering cross section multiplied by a universal splitting probability, with corrections suppressed by the degree of collinearity. Collinear universality is a fundamental property of QCD and appears in many applications, most famously in deriving the DGLAP evolution equations Gribov and Lipatov (1972); Dokshitzer (1977); Altarelli and Parisi (1977) (see also Floratos et al. (1977, 1979); Gonzalez-Arroyo et al. (1979); Gonzalez-Arroyo and Lopez (1980); Curci et al. (1980); Furmanski and Petronzio (1980); Floratos et al. (1981); Hamberg and van Neerven (1992); Vogt et al. (2004); Moch et al. (2004)), and it is at the heart of the factorization theorem in hadron-hadron collisions Collins et al. (1984, 1985). In addition, parton shower generators are based on recursively applying splittings Mazzanti and Odorico (1980); Sjostrand (1982); Marchesini and Webber (1984), fixed-order subtraction schemes utilize the splitting function Ellis et al. (1981); Fabricius et al. (1981); Catani and Seymour (1997), and the jet clustering metric is based on recombination Catani et al. (1991, 1993); Ellis and Soper (1993). Collinear universality can be extended to multi-parton splittings at tree level and beyond Catani and Grazzini (1999, 2000); Bern et al. (1998, 1999); Badger and Glover (2004); Berends and Giele (1988); Mangano and Parke (1991); Campbell and Glover (1998); Del Duca et al. (2000); Birthwright et al. (2005a, b); Bern et al. (1994a, b); Bern and Chalmers (1995); Kosower and Uwer (1999); Catani et al. (2004); Bern et al. (2004); however its all-orders validity Kosower (1999); Feige and Schwartz (2014) is spoiled in the presence of Glauber modes Catani et al. (2012); Forshaw et al. (2012); Rothstein and Stewart (2016); Schwartz et al. (2017). More recently, jet substructure techniques Seymour (1991, 1994); Butterworth et al. (2002, 2007, 2008) have been introduced to distinguish decays of heavy particles from splittings in QCD in order to enhance the search for new physics at the Large Hadron Collider (LHC) Abdesselam et al. (2011); Altheimer et al. (2012, 2014); Adams et al. (2015).
Despite its ubiquity, however, the splitting function cannot be directly measured at a collider, since collinear universality is inseparable from the existence of collinear singularities and closely related non-perturbative fragmentation functions. Specifically, when two partons are separated by an angle , the splitting probability takes the form
[TABLE]
where the are the Altarelli-Parisi QCD splitting functions Altarelli and Parisi (1977) which depend on the momentum fraction and the parton flavors , , and . Crucially, this expression has a real emission singularity in the limit, as required to cancel corresponding virtual singularities from loop diagrams. In this sense, there is no way to directly measure the splitting function in data, though there is of course overwhelming indirect evidence that is a universal function from the many successes of QCD in describing high-energy scattering (see e.g. Brandelik et al. (1979); Barber et al. (1979); Berger et al. (1979); Bartel et al. (1980); Abreu et al. (2000); Akrawy et al. (1990); Abbiendi et al. (2001); Heister et al. (2004); Abdallah et al. (2003); Abbiendi et al. (2005); Achard et al. (2004)).
In this letter, we present a semi-direct method to test the splitting function in QCD by studying the 2-prong substructure of jets. Our method is based on soft drop declustering Larkoski et al. (2014) (see also Butterworth et al. (2008); Dasgupta et al. (2013a, b)), which recursively removes soft radiation from a jet until hard 2-prong substructure is found. When applied to ordinary quark- and gluon-initiated jets with no intrinsic substructure, soft drop exposes the collinear core of the jet. As shown in ref. Larkoski et al. (2015), the momentum sharing between the two prongs (denoted ) is closely related to the momentum fraction appearing in eq. (1), and the cross section for asymptotes to the QCD splitting function in the high-energy limit. While variants of have appeared in many jet substructure studies (notably the parameter in refs. Butterworth et al. (2008); Aad et al. (2015)), to the best of our knowledge, no published distribution has ever been presented using actual collider data, though there are preliminary results from CMS CMS (2016), STAR Kauder , and ALICE Lapidus . Here, we present the first analysis of using LHC data, taking advantage for the first time of public data released by the CMS experiment CMS (a).
The CMS Open Data is derived from 7 TeV center-of-mass proton-proton collisions recorded in 2010 and released to the public on the CERN Open Data Portal in November 2014 CER . The data is provided in AOD (Analysis Object Data) format, which is a CMS-specific data scheme based on the ROOT framework Brun and Rademakers (1997). Crucially for the purposes of studying jet substructure, the AOD format contains all of the particle flow candidates (PFCs) CMS (2009, b) used for jet finding within CMS Khachatryan et al. (2017), and we can apply jet substructure techniques directly on the PFCs themselves. The AOD files have an associated conditions database which include jet energy correction (JEC) factors and recommended jet quality cuts, though no specific calibration tools for jet substructure studies. The main limitation of the 2010 CMS Open Data release is that it does not come accompanied by detector-simulated Monte Carlo samples, though this issue has been partially addressed in the 2011 CMS Open Data release CMS (c). Even without a detector simulation, we can improve the robustness of our analysis by using a charged-particle subset of PFCs with better angular resolution. Overall, this study highlights the fantastic performance of CMS’s particle flow algorithm and the exciting physics opportunities made possible by this public data release.
Our analysis is based on De Gruttola (2010); CMS (1900) of data collected using the Jet Primary Dataset CMS (a), which contains events selected by single-jet triggers, di-jet triggers, as well as some quad-jet and triggers. We use the HLT_Jet30U/50U/70U/100U/140U triggers for this analysis, which gives us near 100% efficiency to select single jets with transverse momentum . All jets in our analysis are clustered using the anti- jet clustering algorithm Cacciari et al. (2008) with radius parameter ; we validated that the anti- jets reported by CMS in the AOD format agree with those found by directly clustering the PFCs with FastJet 3.1.3 Cacciari et al. (2012). To gain a more transparent understanding of the CMS data, we converted the AOD file format into our own text-based MIT Open Data (MOD) file format. Information about the MOD format as well as a broader suite of jet substructure analyses will be presented in a companion paper Tripathee et al. (2017). The substructure results shown here use the RecursiveTools 1.0.0 package from FastJet contrib 1.019 fjc .
To validate initial jet reconstruction, Fig. 1 shows the spectrum of the hardest jet in the event, with a pseudorapidity cut of and transverse momentum cut of . This spectrum is obtained after applying the “loose” jet quality criteria provided by CMS as well as rescaling the jet by the provided JEC factors. For comparison, we show the same spectrum obtained from three parton shower generators with their default settings: Pythia 8.219 Sjostrand et al. (2007), Herwig 7.0.3 Bellm et al. (2016), and Sherpa 2.2.1 Gleisberg et al. (2009). The qualitative agreement between all four samples is excellent. Note that this spectrum is obtained after combining five different CMS triggers with prescale factors that changed over the course of the 2010 run. No kinks are observed at the transitions between the various triggers, giving us confidence that we can derive jet spectra using the trigger and prescale values provided in the AOD files.
We now turn to an analysis of the 2-prong substructure of the hardest jet, imposing a further cut of in order to avoid the large prescale factors present in the HLT_Jet30U/50U triggers. To partially account for the finite energy resolution and efficiency of the CMS detector, we only consider PFCs within the hardest jet above . Moreover, because charged particles have better angular resolution than neutral ones, our analysis will be only based on charged particles with associated tracks; we refer the reader to ref. Tripathee et al. (2017) for substructure analyses with both charged and neutral PFCs. The charged PFCs are reclustered with the Cambridge/Aachen (C/A) algorithm Wobisch and Wengler (1998); Dokshitzer et al. (1997) to form an angular-ordered clustering tree. We then apply the soft drop declustering procedure Larkoski et al. (2014) in Fig. 2, which recursively declusters the C/A tree, removing the softer branch until 2-prong substructure is found which satisfies
[TABLE]
Here, and are the transverse momenta of the two branches of the C/A tree, and is their relative rapidity-azimuth distance. Throughout our analysis, we take the momentum fraction cut and angular exponent to be
[TABLE]
such that soft drop acts like the modified mass drop tagger (mMDT) Dasgupta et al. (2013a) with . The values of and obtained after soft drop are referred to as and , where the subscript is a reminder that these values were obtained after jet grooming. These two observables encode information about the two non-trivial kinematic variables in the unpolarized QCD splitting function from eq. (1). Note that is a ratio of scales, so not affected by the JEC factor applied to the jet as a whole. Similarly, as a dimensionless quantity, is relatively insensitive to the absolute energy scale of the PFCs, and is only mildly affected by the restriction.
The key observable used in jet substructure analyses at ATLAS and CMS is the jet invariant mass Aad et al. (2012); Chatrchyan et al. (2013); Aaltonen et al. (2012). The track-only jet mass spectrum before and after soft drop is shown in Fig. 3 and compared to predictions from Pythia. There is reasonable qualitative agreement between the CMS Open Data and Pythia for ; below one expects deviations from the finite detector resolution of CMS and the fact that the PFCs do not include full hadron mass information. We emphasize that no additional corrections have been applied to the CMS Open Data, apart from the JEC factor needed to impose the criteria and the PFC restriction needed to account for finite energy resolution and efficiency. Similarly, we are showing particle-level predictions from Pythia using the default tune with no detector simulation (but the same restriction to charged hadrons with GeV). Because we do not have access to detector-simulated Monte Carlo samples, and because there is insufficient information in the AOD format to estimate systematic uncertainties, the error bars shown only include statistical uncertainties.
To see the 2-prong structure revealed by soft drop, Fig. 4 shows the double-differential track spectrum seen in the CMS Open Data. The peak towards small values of and reflects the double-logarithmic structure in eq. (1), since soft gluon emission from a hard quark or gluon is approximated by
[TABLE]
where is the strong coupling constant and is the Casimir factor ( for quarks, for gluons). The distribution is cut off by , which regulates the soft singularity of QCD. In principle, the distribution could extend all the way to zero, but it is cut off both by the angular resolution of the CMS detector and by non-perturbative QCD effects which are relevant for . In addition, the perturbative singularity in eq. (1) is regulated by a single-logarithmic form factor Larkoski et al. (2014), which we now exploit to perform analytic calculations of the distribution.
In perturbative QCD, with is a collinear-unsafe observable and therefore not calculable order by order in an expansion in the strong coupling constant . In particular, is ambiguous for a jet containing a single parton, and therefore real emission singularities associated with 2 partons (where is well defined) cannot cancel against virtual emission singularities associated with 1 parton (where is ill defined). That said, we can follow the strategy outlined in refs. Larkoski and Thaler (2013); Larkoski et al. (2015) and express the normalized probability distribution as
[TABLE]
where is the probability distribution for , and is the conditional probability distribution for given a fixed value of . While is collinear unsafe, the conditional probability distribution is calculable as a perturbative expansion, since any finite value of will remove the 1 parton region of phase space. By resumming the distribution to all orders in , the limit is regulated, and the integral in eq. (5) yields a finite distribution for . In this way, is a collinear unsafe but “Sudakov safe” observable Larkoski and Thaler (2013).
Remarkably, to lowest non-trivial order, the probability distribution for can be directly expressed in terms of the QCD splitting function as Larkoski et al. (2015)
[TABLE]
where is the fraction of the event sample composed of jets initiated by partons of flavor (i.e. quarks or gluons), and
[TABLE]
where
[TABLE]
The distribution is a flavor-averaged, -symmetrized, -truncated, and normalized version of the QCD splitting function. Because of a supersymmetric relationship between the quark and gluon splitting functions Dokshitzer et al. (1991); Seymour (1998), is the same for quarks and gluons to an excellent approximation, such that
[TABLE]
and the probability distribution for is independent of at leading order. In this way, measuring exposes the QCD splitting function. The predicted distribution can be refined by performing higher-order calculations. As in ref. Larkoski et al. (2015), we calculate to modified leading-logarithmic (MLL) accuracy, which includes running coupling effects and subleading terms in the splitting functions. We also calculate to leading fixed order in the collinear approximation and obtain an analytic prediction for using eq. (5). While not shown below, the theoretical uncertainties on can be estimated by varying the different renormalization scales that enter the calculation Tripathee et al. (2017).
In Fig. 5, we show the distribution for our jet selection, comparing the analytic expression in eq. (5) (which extends eq. (9) to MLL accuracy), three parton shower generators, and the CMS Open Data. Strictly speaking, the theoretical calculation described above should be modified Chang et al. (2013a, b) to account for the fact that the current analysis is based only on charged particles; for this reason, we show without its uncertainty band to emphasize its qualitative nature. Notwithstanding the above, the CMS Open Data agrees very well with the theory calculation as well as with the Monte Carlo parton showers, and the characteristic behavior expected from the QCD splitting function is seen in all distributions. The one point where there is a noticeable (but expected) difference between the open data and the parton showers is at , which corresponds to jets that have only one constituent after soft drop. Because close-by particles can be reconstructed as a single PFC due to finite angular resolution, the CMS Open Data is expected to have more “one particle” jets than the parton shower generators. We have evidence that the small difference between the parton showers and the theory distribution at is due to growing logarithms of that are not resummed in our MLL approach. We verified that these discrepancies are suppressed for and enhanced for , consistent with this expectation.
The CMS Open Data represents a new chapter in particle physics, since for the first time, high-quality collider data has been released to scientists not affiliated with an experimental collaboration. In this paper, we applied state-of-the-art jet substructure techniques on the CMS Open Data and exposed the QCD splitting function, which encodes the universal behavior of gauge theories in the collinear limit. This was only possible because of theoretical advances on Sudakov safe observables, which allowed us to predict the distribution from first principles, and the fantastic experimental performance of the CMS detector, which allowed us to perform a detailed study of the substructure of jets. We hope this letter inspires scientists outside of the LHC collaborations to incorporate CMS Open Data into their research and motivates the LHC collaborations to continue their support of open data initiatives.
Acknowledgements.
We applaud CERN for the historic launch of the Open Data Portal, and we congratulate the CMS collaboration for the fantastic performance of their detector and the high quality of the resulting public data set. We thank Alexis Romero for collaboration in the early stages of this work. We are indebted to Salvatore Rappoccio and Kati Lassila-Perini for helping us navigate the CMS software framework. We benefited from code and encouragement from Tim Andeen, Matt Bellis, Andy Buckley, Kyle Cranmer, Sarah Demers, Guenther Dissertori, Javier Duarte, Peter Fisher, Achim Geiser, Giacomo Govi, Phil Harris, Beate Heinemann, Harri Hirvonsalo, Markus Klute, Greg Landsberg, Yen-Jie Lee, Elliot Lipeles, Peter Loch, Marcello Maggi, David Miller, Ben Nachman, Christoph Paus, Alexx Perloff, Andreas Pfeiffer, Maurizio Pierini, Ana Rodriguez, Gunther Roland, Ariel Schwartzman, Liz Sexton-Kennedy, Maria Spiropulu, Nhan Tran, Ana Trisovic, Chris Tully, Marta Verweij, Mikko Voutilainen, and Mike Williams. This work is supported by the MIT Charles E. Reed Faculty Initiatives Fund. The work of JT, AT, and WX is supported by the U.S. Department of Energy (DOE) under grant contract numbers DE-SC-00012567 and DE-SC-00015476. The work of AL was supported by the U.S. National Science Foundation, under grant PHY–1419008, the LHC Theory Initiative. SM is supported by the U.S. National Science Foundation, under grants PHY–0969510 (LHC Theory Initiative) and PHY–1619867. AT is also supported by the MIT Undergraduate Research Opportunities Program.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Gribov and Lipatov (1972) V.N. Gribov and L.N. Lipatov, “Deep inelastic e p scattering in perturbation theory,” Sov.J.Nucl.Phys. 15 , 438–450 (1972).
- 2Dokshitzer (1977) Yuri L. Dokshitzer, “Calculation of the Structure Functions for Deep Inelastic Scattering and e+ e- Annihilation by Perturbation Theory in Quantum Chromodynamics.” Sov.Phys.JETP 46 , 641–653 (1977).
- 3Altarelli and Parisi (1977) Guido Altarelli and G. Parisi, “Asymptotic Freedom in Parton Language,” Nucl.Phys. B 126 , 298 (1977) . · doi ↗
- 4Floratos et al. (1977) E. G. Floratos, D. A. Ross, and Christopher T. Sachrajda, “Higher Order Effects in Asymptotically Free Gauge Theories: The Anomalous Dimensions of Wilson Operators,” Nucl. Phys. B 129 , 66–88 (1977) , [Erratum: Nucl. Phys.B 139,545(1978)]. · doi ↗
- 5Floratos et al. (1979) E. G. Floratos, D. A. Ross, and Christopher T. Sachrajda, “Higher Order Effects in Asymptotically Free Gauge Theories. 2. Flavor Singlet Wilson Operators and Coefficient Functions,” Nucl. Phys. B 152 , 493–520 (1979) . · doi ↗
- 6Gonzalez-Arroyo et al. (1979) Antonio Gonzalez-Arroyo, C. Lopez, and F. J. Yndurain, “Second Order Contributions to the Structure Functions in Deep Inelastic Scattering. 1. Theoretical Calculations,” Nucl. Phys. B 153 , 161–186 (1979) . · doi ↗
- 7Gonzalez-Arroyo and Lopez (1980) Antonio Gonzalez-Arroyo and C. Lopez, “Second Order Contributions to the Structure Functions in Deep Inelastic Scattering. 3. The Singlet Case,” Nucl. Phys. B 166 , 429–459 (1980) . · doi ↗
- 8Curci et al. (1980) G. Curci, W. Furmanski, and R. Petronzio, “Evolution of Parton Densities Beyond Leading Order: The Nonsinglet Case,” Nucl. Phys. B 175 , 27–92 (1980) . · doi ↗
