Distributed training and scalability for the particle clustering method UCluster
Olga Sunneborn Gudnadottir, Daniel Gedon, Colin Desmarais, Karl, Bengtsson Bernander, Raazesh Sainudiin, Rebeca Gonzalez Suarez

TL;DR
This paper enhances the UCluster particle clustering method by enabling scalable, distributed training using Horovod and TensorFlow v2, making it suitable for large-scale LHC data analysis.
Contribution
The paper introduces a distributed training extension for UCluster, allowing it to handle arbitrarily large datasets efficiently.
Findings
Distributed training reduces training time proportionally to GPU count
Model maintains clustering accuracy with distributed training setup
Code migration to TensorFlow v2 enables scalability
Abstract
In recent years, machine-learning methods have become increasingly important for the experiments at the Large Hadron Collider (LHC). They are utilised in everything from trigger systems to reconstruction and data analysis. The recent UCluster method is a general model providing unsupervised clustering of particle physics data, that can be easily modified to provide solutions for a variety of different decision problems. In the current paper, we improve on the UCluster method by adding the option of training the model in a scalable and distributed fashion, and thereby extending its utility to learn from arbitrarily large data sets. UCluster combines a graph-based neural network called ABCnet with a clustering step, using a combined loss function in the training phase. The original code is publicly available in TensorFlow v1.14 and has previously been trained on a single GPU. It shows a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
