Aspen Open Jets: Unlocking LHC Data for Foundation Models in Particle Physics
Oz Amram, Luca Anzalone, Joschka Birk, Darius A. Faroughy, Anna Hallin, Gregor Kasieczka, Michael Kr\"amer, Ian Pang, Humberto Reyes-Gonzalez, David Shih

TL;DR
This paper introduces the AspenOpenJets dataset from CMS LHC data and demonstrates how pre-training a foundation model on it enhances jet generation tasks in particle physics.
Contribution
It presents a new large-scale dataset and shows how pre-training a foundation model on real collider data improves generative performance under domain shift.
Findings
Pre-training on AspenOpenJets improves jet generation quality.
The dataset enables better generalization to new particle physics tasks.
Foundation models benefit from real collider data pre-training.
Abstract
Foundation models are deep learning models pre-trained on large amounts of data which are capable of generalizing to multiple datasets and/or downstream tasks. This work demonstrates how data collected by the CMS experiment at the Large Hadron Collider can be useful in pre-training foundation models for HEP. Specifically, we introduce the AspenOpenJets dataset, consisting of approximately 178M high jets derived from CMS 2016 Open Data. We show how pre-training the OmniJet- foundation model on AspenOpenJets improves performance on generative tasks with significant domain shift: generating boosted top and QCD jets from the simulated JetClass dataset. In addition to demonstrating the power of pre-training of a jet-based foundation model on actual proton-proton collision data, we provide the ML-ready derived AspenOpenJets dataset for further public use.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParticle physics theoretical and experimental studies · Superconducting Materials and Applications · Computational Physics and Python Applications
