# Fast and Interpretable Machine Learning Modeling of Atmospheric Molecular Clusters

**Authors:** Lauri Seppäläinen, Jakub Kubečka, Jonas Elm, Kai R. Puolamäki

PMC · DOI: 10.1021/acs.jpca.5c06950 · The Journal of Physical Chemistry. a · 2026-01-15

## TL;DR

This paper shows that k-NN models can quickly and accurately predict properties of atmospheric molecular clusters, helping improve climate modeling.

## Contribution

The paper introduces k-NN as a fast and interpretable alternative to complex models for atmospheric cluster analysis.

## Key findings

- k-NN models achieve near-chemical accuracy with large atmospheric cluster datasets.
- k-NN models reduce computational time by orders of magnitude compared to KRR models.
- k-NN models extrapolate to larger unseen clusters with minimal error.

## Abstract

Understanding how atmospheric molecular clusters form
and grow
is key to resolving one of the biggest uncertainties in climate modeling:
the formation of new aerosol particles. While quantum chemistry offers
accurate insights into these early-stage clusters, its steep computational
costs limit large-scale exploration. In this work, we present a fast,
interpretable, and surprisingly powerful alternative: the k-nearest neighbor (k‑NN) regression
model. By leveraging chemically informed distance metrics, including
a kernel-induced metric and one learned via metric learning for kernel
regression (MLKR), we show that simple k-NN models
can rival more complex kernel ridge regression (KRR) models in accuracy
while reducing computational time by orders of magnitude. We perform
this comparison with the well-established Faber–Christensen–Huang–Lilienfeld
(FCHL19) molecular descriptor; however, other descriptors (e.g., FCHL18,
MBDF, and CM) can be shown to have similar performance. Applied to
both simple organic molecules in the QM9 benchmark set and large data
sets of atmospheric molecular clusters (sulfuric acid–water
and sulfuric–multibase–base systems), our k-NN models achieve near-chemical accuracy, scale seamlessly to data
sets with over 250,000 entries, and even appears to extrapolate to
larger unseen clusters with minimal error (often nearing 1 kcal/mol).
With built-in interpretability and straightforward uncertainty estimation,
this work positions k-NN as a potent tool for accelerating
discovery in atmospheric chemistry and beyond.

## Linked entities

- **Chemicals:** sulfuric acid (PubChem CID 1118), water (PubChem CID 962)

## Full-text entities

- **Genes:** PDF (peptide deformylase, mitochondrial) [NCBI Gene 64146]
- **Chemicals:** C (MESH:D002244), MA (MESH:C027451), O (MESH:D010100), F (MESH:D005461), ethylenediamine (MESH:C031234), H (MESH:D006859), SA (MESH:C033158), TMA (MESH:C023336), Water (MESH:D014867), N (MESH:D009584), FA (MESH:C030544), QM9 (-), NA (MESH:D017942), MSA (MESH:C045880), AM (MESH:D000641), DMA (MESH:C034516), W (MESH:D014414)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12862803/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12862803/full.md

## References

71 references — full list in the complete paper: https://tomesphere.com/paper/PMC12862803/full.md

---
Source: https://tomesphere.com/paper/PMC12862803