Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction
Darius A. Faroughy, Sofia Palacios Schweitzer, Ian Pang, Siddharth Mishra-Sharma, David Shih

TL;DR
Collider-Bench introduces a benchmark for evaluating AI agents' ability to reproduce complex particle physics analyses from LHC papers using open tools, highlighting current limitations in automation.
Contribution
This work presents a novel benchmark and dataset for testing LLM agents on reproducing LHC analyses, emphasizing physical reasoning and domain knowledge.
Findings
No agent reliably matches physicist-in-the-loop performance
Benchmark captures fidelity of analysis reproduction without hand-written rubrics
Evaluates computational cost and qualitative failure modes of agents
Abstract
Autonomous language-model agents are increasingly evaluated on long-horizon tool-use tasks, but existing benchmarks rarely capture the complexity and nuance of real scientific work. To address this gap, we introduce Collider-Bench, a benchmark for evaluating whether LLM agents can reproduce experimental analyses from the Large Hadron Collider (LHC) using only public papers and open scientific software. Such analyses are often difficult to reproduce because the public toolchain only approximates the software used internally by the experimental collaborations, while the published papers inevitably omit implementation details needed for a faithful reconstruction. Agents must therefore rely on physical reasoning, domain knowledge, and trial-and-error to fill these gaps. Each task requires the agent to turn a published analysis into an executable simulation-and-selection pipeline and submit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
