Quark-versus-gluon tagging in CMS Open Data with CWoLa and TopicFlow
Matthew J. Dolan, John Gargalionis, Ayodele Ore

TL;DR
This paper evaluates weakly-supervised quark-gluon jet tagging using CMS Open Data, comparing models trained on real data versus simulation, and introduces TopicFlow for uncertainty estimation and smoothing.
Contribution
It demonstrates the application of CWoLa and TopicFlow models to real CMS data for quark-gluon tagging, highlighting differences from simulation-based training.
Findings
Weakly-supervised models outperform fully-supervised ones on real data.
Model rankings differ between simulation and real data evaluations.
TopicFlow effectively smooths fluctuations and estimates uncertainties.
Abstract
We use the CMS Open Data to examine the performance of weakly-supervised learning for tagging quark and gluon jets at the LHC. We target +jet and dijet events as respective quark- and gluon-enriched mixtures and derive samples both from data taken in 2011 at 7 TeV, and from Monte Carlo. CWoLa and TopicFlow models are trained on real data and compared to fully-supervised classifiers trained on simulation. In order to obtain estimates for the discrimination power in real data, we consider three different estimates of the quark/gluon mixture fractions in the data. Compared to when the models are evaluated on simulation, we find reversed rankings for the fully- and weakly-supervised approaches. Further, these rankings based on data are robust to the estimate of the mixture fraction in the test set. Finally, we use TopicFlow to smooth statistical fluctuations in the small testing set, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParticle physics theoretical and experimental studies · High-Energy Particle Collisions Research · Radiomics and Machine Learning in Medical Imaging
