Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction

Darius A. Faroughy; Sofia Palacios Schweitzer; Ian Pang; Siddharth Mishra-Sharma; David Shih

arXiv:2605.13950·cs.LG·May 15, 2026

Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction

Darius A. Faroughy, Sofia Palacios Schweitzer, Ian Pang, Siddharth Mishra-Sharma, David Shih

PDF

TL;DR

Collider-Bench introduces a benchmark for evaluating AI agents' ability to reproduce complex particle physics analyses from LHC papers using open tools, highlighting current limitations in automation.

Contribution

This work presents a novel benchmark and dataset for testing LLM agents on reproducing LHC analyses, emphasizing physical reasoning and domain knowledge.

Findings

01

No agent reliably matches physicist-in-the-loop performance

02

Benchmark captures fidelity of analysis reproduction without hand-written rubrics

03

Evaluates computational cost and qualitative failure modes of agents

Abstract

Autonomous language-model agents are increasingly evaluated on long-horizon tool-use tasks, but existing benchmarks rarely capture the complexity and nuance of real scientific work. To address this gap, we introduce Collider-Bench, a benchmark for evaluating whether LLM agents can reproduce experimental analyses from the Large Hadron Collider (LHC) using only public papers and open scientific software. Such analyses are often difficult to reproduce because the public toolchain only approximates the software used internally by the experimental collaborations, while the published papers inevitably omit implementation details needed for a faithful reconstruction. Agents must therefore rely on physical reasoning, domain knowledge, and trial-and-error to fill these gaps. Each task requires the agent to turn a published analysis into an executable simulation-and-selection pipeline and submit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.