# Representation of Molecules by Sequences of Instructions

**Authors:** Karl Thurnhofer-Hemsi, Iván García-Aguilar, José David Fernández-Rodriguez, Ezequiel López-Rubio

PMC · DOI: 10.1021/acs.jcim.5c00354 · 2025-07-28

## TL;DR

This paper introduces a new way to represent molecules using sequences of instructions that ensure valid and modifiable molecular structures for computational methods.

## Contribution

A novel chemical nomenclature system using instruction sequences that guarantees valid molecular representations and allows small structural modifications.

## Key findings

- A reduced instruction set generates valid molecular representations.
- Small changes in instruction sequences correspond to small molecular modifications.
- The approach is suitable for computational intelligence systems like deep learning.

## Abstract

The processing of chemical information by computational
intelligence
methods faces the challenge of the structural complexity of molecular
graphs. These graphs are not amenable to being represented in a suitable
way for such methods. The most popular representation is the SMILES
notation standard. However, it comes with some limitations, such as
the abundance of nonvalid strings and the fact that similar strings
often represent very different molecules. In this work, a completely
different approach to chemical nomenclature is presented. A reduced
instruction set is defined, and the language of all strings that are
sequences of such instructions is considered. Instructions provide
the means to incrementally add atoms and modify the connectivity of
the chemical bonds of atoms to be inserted. Instructions are carefully
crafted to guarantee that all strings of this language are valid,
i.e., each string represents a molecule. Moreover, slight changes
in a string usually correspond to small modifications in the represented
molecule. Therefore, this approach is appropriate for use in state-of-the-art
computational intelligence systems for chemical information processing,
including deep learning models.

## Full-text entities

- **Chemicals:** triethylamine (MESH:C016162), 4-Bromopentanal Molecule (-), (S)-(+)-2-butanol (MESH:C043958), O (MESH:D010100), C (MESH:D002244), Flavone (MESH:C043562), F (MESH:D005461), ZINC (MESH:D015032), halogen (MESH:D006219), hydrogen (MESH:D006859), isobutylamine (MESH:C053521), N (MESH:D009584), 2,4,5-trichlorophenol (MESH:C009534), 1,4-hexadiene (MESH:C032726), Br (MESH:D001966), Cl (MESH:D002713), boron (MESH:D001895), cyclohexane (MESH:C506365), Propane (MESH:D011407), isobutyric acid (MESH:C020380), hydroxyacetaldehyde (MESH:C010972), I (MESH:D007455)
- **Mutations:** C3 C, CCC2-C

## Figures

45 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12344769/full.md

---
Source: https://tomesphere.com/paper/PMC12344769