Scaling Particle Collision Data Analysis
Hengkui Wu, Panpan Chi, Yongfeng Zhu, Liujiang Liu, Shuyang Hu, Yuexin, Wang, Chen Zhou, Qihao Wang, Yingsi Xin, Bruce Liu, Dahao Liang, Xinglong Jia, and Manqi Ruan

TL;DR
This paper introduces BBT-Neutron, a novel large language model with binary tokenization designed for large-scale numerical data analysis in particle physics, demonstrating competitive performance in jet origin classification.
Contribution
The paper presents BBT-Neutron, a task-agnostic LLM with binary tokenization that effectively handles numerical scientific data, extending LLM capabilities to scientific big data analysis.
Findings
BBT-Neutron achieves performance comparable to specialized models in jet origin identification.
Binary tokenization improves LLM handling of large-scale numerical data.
Scaling data volume enhances BBT-Neutron's performance, indicating its potential as a scientific foundation model.
Abstract
For decades, researchers have developed task-specific models to address scientific challenges across diverse disciplines. Recently, large language models (LLMs) have shown enormous capabilities in handling general tasks; however, these models encounter difficulties in addressing real-world scientific problems, particularly in domains involving large-scale numerical data analysis, such as experimental high energy physics. This limitation is primarily due to BPE tokenization's inefficacy with numerical data. In this paper, we propose a task-agnostic architecture, BBT-Neutron, which employs a binary tokenization method to facilitate pretraining on a mixture of textual and large-scale numerical experimental data. We demonstrate the application of BBT-Neutron to Jet Origin Identification (JoI), a critical categorization challenge in high-energy physics that distinguishes jets originating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGamma-ray bursts and supernovae · Laser-induced spectroscopy and plasma · High-Energy Particle Collisions Research
MethodsByte Pair Encoding
