jBOT: Semantic Jet Representation Clustering Emerges from Self-Distillation

Ho Fung Tsoi; Dylan Rankin

arXiv:2601.11719·cs.LG·April 27, 2026

jBOT: Semantic Jet Representation Clustering Emerges from Self-Distillation

Ho Fung Tsoi, Dylan Rankin

PDF

TL;DR

jBOT is a self-distillation pre-training method for jet data from CERN LHC, enabling semantic clustering, anomaly detection, and improved classification without labels.

Contribution

It introduces a novel self-distillation approach for jet data that produces meaningful semantic embeddings supporting various downstream tasks.

Findings

01

Unsupervised pre-training leads to emergent semantic class clustering.

02

Clustering enables effective anomaly detection using simple metrics.

03

Fine-tuning improves classification performance over supervised models.

Abstract

Self-supervised learning, in the context of foundation model training, is a powerful pre-training method for learning feature representations without labels, which often capture generic underlying semantics from the data and can later be fine-tuned for downstream tasks. In this work, we introduce jBOT, a pre-training method based on self-distillation for jet data from the CERN Large Hadron Collider, which combines local particle-level distillation with global jet-level distillation to learn jet representations that support downstream tasks such as anomaly detection and classification. We observe that pre-training on unlabeled jets leads to emergent semantic class clustering in the representation space. The clustering in the frozen embedding, when pre-trained on background jets only, enables anomaly detection via simple distance-based metrics, and the learned embedding can be fine-tuned…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.