# A dataset of chemical reaction pathways incorporating halogen chemistry

**Authors:** Minhyeok Lee, Jinyoung Jeong, Islambek Ashyrmamatov, Umit V. Ucak, Sunwoo Kim, Juyong Lee, Eunji Sim

PMC · DOI: 10.1038/s41597-025-05944-3 · Scientific Data · 2025-10-20

## TL;DR

This paper introduces Halo8, a large dataset of chemical reactions that includes halogens, improving machine learning models for chemistry applications.

## Contribution

Halo8 is the first comprehensive dataset systematically incorporating halogen chemistry into reaction pathways for MLIP training.

## Key findings

- Halo8 contains 20 million quantum chemical calculations from 19,000 reaction pathways with halogen coverage.
- The dataset provides accurate energies, forces, and other properties at the ωB97X-3c level.
- Validation shows Halo8 captures diverse chemical environments important for reactive systems.

## Abstract

Machine learning interatomic potentials (MLIPs) promise to revolutionize computational chemistry; however, their performance depends critically on the quality and diversity of the training data. Existing quantum chemical datasets predominantly focus on equilibrium structures and exhibit limited halogen coverage, despite halogens being present in approximately 25% of pharmaceuticals and numerous materials. We present Halo8, a comprehensive dataset that addresses this gap by systematically incorporating fluorine, chlorine, and bromine chemistry into reaction pathway sampling. Using our efficient multi-level computational workflow, which achieves a 110-fold speedup over pure DFT approaches, Halo8 comprises approximately 20 million quantum chemical calculations from 19,000 unique reaction pathways. The dataset combines recalculated Transition1x reactions with new halogen-containing molecules from GDB-13, employing systematic halogen substitution to maximize chemical diversity. All calculations were performed at the ωB97X-3c level, providing accurate energies, forces, dipole moments, and partial charges. Validation demonstrates that Halo8 captures diverse structural distortions and chemical environments essential for reactive systems, serving as a valuable resource for training MLIPs applicable to pharmaceutical discovery, materials design, and catalysis.

## Linked entities

- **Chemicals:** fluorine (PubChem CID 24524), chlorine (PubChem CID 312), bromine (PubChem CID 24408)

## Full-text entities

- **Chemicals:** bromine (MESH:D001966), halogen (MESH:D006219), chlorine (MESH:D002713), GDB-13 (-), fluorine (MESH:D005461)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12537968/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12537968/full.md

## References

7 references — full list in the complete paper: https://tomesphere.com/paper/PMC12537968/full.md

---
Source: https://tomesphere.com/paper/PMC12537968