Distributed training and scalability for the particle clustering method   UCluster

Olga Sunneborn Gudnadottir; Daniel Gedon; Colin Desmarais; Karl; Bengtsson Bernander; Raazesh Sainudiin; Rebeca Gonzalez Suarez

arXiv:2109.00264·hep-ex·September 2, 2021

Distributed training and scalability for the particle clustering method UCluster

Olga Sunneborn Gudnadottir, Daniel Gedon, Colin Desmarais, Karl, Bengtsson Bernander, Raazesh Sainudiin, Rebeca Gonzalez Suarez

PDF

TL;DR

This paper enhances the UCluster particle clustering method by enabling scalable, distributed training using Horovod and TensorFlow v2, making it suitable for large-scale LHC data analysis.

Contribution

The paper introduces a distributed training extension for UCluster, allowing it to handle arbitrarily large datasets efficiently.

Findings

01

Distributed training reduces training time proportionally to GPU count

02

Model maintains clustering accuracy with distributed training setup

03

Code migration to TensorFlow v2 enables scalability

Abstract

In recent years, machine-learning methods have become increasingly important for the experiments at the Large Hadron Collider (LHC). They are utilised in everything from trigger systems to reconstruction and data analysis. The recent UCluster method is a general model providing unsupervised clustering of particle physics data, that can be easily modified to provide solutions for a variety of different decision problems. In the current paper, we improve on the UCluster method by adding the option of training the model in a scalable and distributed fashion, and thereby extending its utility to learn from arbitrarily large data sets. UCluster combines a graph-based neural network called ABCnet with a clustering step, using a combined loss function in the training phase. The original code is publicly available in TensorFlow v1.14 and has previously been trained on a single GPU. It shows a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.