# Zombie cheminformatics: extraction and conversion of Wiswesser Line Notation (WLN) from chemical documents

**Authors:** Michael Blakey, Samantha Pearman-Kanza, Jeremy G. Frey

PMC · DOI: 10.1186/s13321-024-00831-2 · Journal of Cheminformatics · 2024-04-15

## TL;DR

This paper introduces tools to extract and convert old WLN chemical notation into modern formats, improving access to historical chemical data.

## Contribution

A new WLN parser and extraction DFA are developed, offering improved accuracy and coverage over prior methods.

## Key findings

- The WLN parser and DFA successfully handle most WLN rules from major manuals.
- Database entries contained notable inaccuracies, which were corrected where possible.
- The tools show improved performance in conversion accuracy compared to previous approaches.

## Abstract

Wiswesser Line Notation (WLN) is a old line notation for encoding chemical compounds for storage and processing by computers. Whilst the notation itself has long since been surpassed by SMILES and InChI, distribution of WLN during its active years was extensive. In the context of modernising chemical data, we present a comprehensive WLN parser developed using the OpenBabel toolkit, capable of translating WLN strings into various formats supported by the library. Furthermore, we have devised a specialised Finite State Machine l, constructed from the rules of WLN, enabling the recognition and extraction of chemical strings out of large bodies of text. Available open-access WLN data with corresponding SMILES or InChI notation is rare, however ChEMBL, ChemSpider and PubChem all contain WLN records which were used for conversion scoring. Our investigation revealed a notable proportion of inaccuracies within the database entries, and we have taken steps to rectify these errors whenever feasible.

Tools for both the extraction and conversion of WLN from chemical documents have been successfully developed. Both the Deterministic Finite Automaton (DFA) and parser handle the majority of WLN rules officially endorsed in the three major WLN manuals, with the parser showing a clear jump in accuracy and chemical coverage over previous submissions. The GitHub repository can be found here: https://github.com/Mblakey/wiswesser.

The online version contains supplementary material available at 10.1186/s13321-024-00831-2.

## Full-text entities

- **Chemicals:** Morphine (MESH:D009020), Hydrogen (MESH:D006859), Hexahydroindan (MESH:C000616753), Phenalene (MESH:D043803), peroxide (MESH:D010545), Benzene (MESH:D001554), Phenanthrene (MESH:C031181), Anthracene (MESH:C034020), C (MESH:D002244), hydrogen peroxide (MESH:D006861), O (MESH:D010100), nitrogen (MESH:D009584), Metallocenes (MESH:D000075163), CaffeineFix (-), Cyclohexane (MESH:C506365)
- **Species:** Homo sapiens (human, species) [taxon 9606]
- **Mutations:** T-T665
- **Cell lines:** NUT — Homo sapiens (Human), Embryonal carcinoma, Cancer cell line (CVCL_WI02)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC11017645/full.md

## Figures

22 figures with captions in the complete paper: https://tomesphere.com/paper/PMC11017645/full.md

## References

34 references — full list in the complete paper: https://tomesphere.com/paper/PMC11017645/full.md

---
Source: https://tomesphere.com/paper/PMC11017645