The Infinite-Dimensional Nature of Spectroscopy and Why Models Succeed, Fail, and Mislead
Umberto Michelucci, Francesca Venturini

TL;DR
This paper explains why machine learning models in spectroscopy often succeed or fail due to the high-dimensional nature of spectral data, showing that even minor differences can lead to perfect separability in high dimensions.
Contribution
The work provides a theoretical framework based on high-dimensional geometry to explain ML model behavior in spectroscopy, supported by experiments on synthetic and real data.
Findings
High-dimensional spectral data can cause models to achieve near-perfect accuracy despite lacking chemical relevance.
Noise and artefacts can be mistaken for meaningful features in high-dimensional spaces.
Theoretical analysis explains why models may highlight irrelevant spectral regions.
Abstract
Machine learning (ML) models have achieved strikingly high accuracies in spectroscopic classification tasks, often without a clear proof that those models used chemically meaningful features. Existing studies have linked these results to data preprocessing choices, noise sensitivity, and model complexity, but no unifying explanation is available so far. In this work, we show that these phenomena arise naturally from the intrinsic high dimensionality of spectral data. Using a theoretical analysis grounded in the Feldman-Hajek theorem and the concentration of measure, we show that even infinitesimal distributional differences, caused by noise, normalisation, or instrumental artefacts, may become perfectly separable in high-dimensional spaces. Through a series of specific experiments on synthetic and real fluorescence spectra, we illustrate how models can achieve near-perfect accuracy even…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
