# A fragment based approach towards curating, comparing and developing machine learning models applied in photochemistry

**Authors:** Raúl Pérez-Soto, Mihai V. Popescu, Sabari Kumar, Leticia A. Gomes, Changyeob Lee, Elijah Shore, Steven A. Lopez, Robert S. Paton, Seonah Kim

PMC · DOI: 10.1039/d5sc05615b · Chemical Science · 2025-10-15

## TL;DR

This paper introduces a fragmentation approach to improve machine learning models for predicting photophysical properties of molecules, enhancing accuracy and generalizability.

## Contribution

The novel contribution is a molecular fragmentation strategy that improves model generalizability and accounts for exciton localization in photochemistry.

## Key findings

- A fragment-based delta learning approach achieves comparable accuracy to traditional graph neural networks.
- The ALFAST-DB database enables better structural diversity comparisons between different molecular libraries.
- Focusing on chromophore moieties improves extrapolation across chemical space in photochemical predictions.

## Abstract

The development of graph neural networks for predicting molecular properties has garnered significant attention, as it enables the correlation of quickly computable atomic and bond descriptors with overall molecular properties. With the rising interest in photochemistry and photocatalysis as sustainable alternatives to thermal reactions, curation of virtual databases of computed photophysical properties for training of machine learning models has become popular. Unfortunately, current efforts fail to consider the exciton localization onto different chromophores of the same molecule, leading to potentially large prediction errors. Here we describe a molecular fragmentation strategy that can be used to overcome this limitation, while also providing a way to compare structural diversity between different libraries. Using a newly generated database of 46 432 adiabatic S0–T1 energy gaps (ALFAST-DB), we compare its diversity with two datasets from the literature and demonstrate that a fragment-based delta learning approach improves model generalizability while achieving accuracies comparable to those of traditional message passing graph neural network architectures (MPGNN).

In light of the development of new machine learning models for photochemical property prediction, we show that model development and database construction should focus on chromophore moieties for good extrapolation across chemical space.

## Full-text entities

- **Genes:** TTC41P (tetratricopeptide repeat domain 41, pseudogene) [NCBI Gene 253724] {aka GNN, GNNP}
- **Chemicals:** C (MESH:D002244), alkene (MESH:D000475), O (MESH:D010100), F (MESH:D005461), Se (MESH:D012643), H (MESH:D006859), Br (MESH:D001966), propene (MESH:C013658), N (MESH:D009584), cyclopentene (MESH:D003517), T1 (MESH:C103828), Cl (MESH:D002713), toluene (MESH:D014050), 3-(2-(4,5-dichlorocyclohex-2-en-1-yl)ethyl)-1,1-dimethylurea (-), CH3CN (MESH:C032159), S (MESH:D013455), P (MESH:D010758), isobutene (MESH:C008176), allylbenzene (MESH:C102347)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12542913/full.md

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12542913/full.md

## References

62 references — full list in the complete paper: https://tomesphere.com/paper/PMC12542913/full.md

---
Source: https://tomesphere.com/paper/PMC12542913