Anomaly-aware summary statistic from data batches
Gaia Grosso

TL;DR
This paper introduces a parallelized approach to the New Physics Learning Machine (NPLM) for large-scale data analysis, enhancing computational efficiency and sensitivity in detecting subtle deviations in collider data, with applications in offline and streaming scenarios.
Contribution
It proposes a batch-wise parallel NPLM method that improves resource efficiency and sensitivity, enabling anomaly detection in large or streaming datasets.
Findings
Outperforms simple sum of batch tests in sensitivity.
Can match or surpass full data test performance.
Enables anomaly-aware summary statistics in streaming data.
Abstract
Signal-agnostic data exploration based on machine learning could unveil very subtle statistical deviations of collider data from the expected Standard Model of particle physics. The beneficial impact of a large training sample on machine learning solutions motivates the exploration of increasingly large and inclusive samples of acquired data with resource efficient computational methods. In this work we consider the New Physics Learning Machine (NPLM), a multivariate goodness-of-fit test built on the Neyman-Pearson maximum-likelihood-ratio construction, and we address the problem of testing large size samples under computational and storage resource constraints. We propose to perform parallel NPLM routines over batches of the data, and to combine them by locally aggregating over the data-to-reference density ratios learnt by each batch. The resulting data hypothesis defining the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Time Series Analysis and Forecasting · Data Management and Algorithms
