Dissecting Jet-Tagger Through Mechanistic Interpretability
Saurabh Rai, Sanmay Ganguly

TL;DR
This paper performs a mechanistic interpretability analysis of a Particle Transformer trained on jet classification, identifying a sparse circuit that captures most of the model's performance and reveals physically meaningful internal representations.
Contribution
It uncovers a minimal six-head circuit responsible for jet classification, linking model components to physical substructure observables in jet physics.
Findings
A six-head circuit recovers most model performance.
The residual stream aligns with energy correlator basis.
Model encodes 2-prong over 3-prong substructure observables.
Abstract
Mechanistic interpretability seeks to reverse engineer a trained neural network by identifying the minimal subset of internal components. We perform a mechanistic interpretability analysis of the Particle Transformer architecture, trained on the Top Quark Tagging reference dataset, with the goal of identifying the computational circuit responsible for jet classification and characterizing the physical content of its internal representations. Combining zero ablation, path patching with two complementary on-manifold corruption strategies and linear probing of the residual stream, we identify a sparse six-head circuit that recovers the great majority of the full model performance while admitting a clean source-relay-readout interpretation. In this circuit, a single early layer head serves as the primary causal source, a cluster of middle-layer heads acts as relays selectively attending to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
