# A comprehensive evaluation of advanced methods for identifying structural alerts using extensive toxicity data

**Authors:** Ning-Ning Wang, Yuan-Hang He, Xin-Liang Li, Shao-Hua Shi, You-Chao Deng, Shao Liu, Dong-Sheng Cao

PMC · DOI: 10.1186/s13321-026-01157-x · Journal of Cheminformatics · 2026-01-30

## TL;DR

This paper evaluates seven tools for identifying toxic substructures in molecules using 43 toxicity datasets to determine their effectiveness and reliability.

## Contribution

The study provides a benchmark substructure set and a comprehensive comparison of substructure extraction tools for toxicity prediction.

## Key findings

- PySmash_circular performed best overall in substructure information and predictive models.
- Bioalerts and PySmash_circular generated substructures with richer information.
- All seven methods improved QSAR model performance for toxic compound recognition.

## Abstract

With the disclosure of the important role of substructural alerts (SA) in drug development and toxicity evaluation, many automatic substructure extraction tools based on different theoretical knowledge have been reported in recent years. To compare the emphasis of various substructure extraction methods and the reliability of their results, we were encouraged to conduct a comprehensive analysis of seven representative tools to find the best one. In this paper, we introduced a well-designed evaluation of seven popular tools (Bioalerts, KRFP, MoSS, PySmash_circular, PySmash_group, PySmash_path, and SARpy) based on 43 toxicity datasets, consisting of four components: comparison of substructures derived by different methods, comparison of predictive models based on substructural rules, comparison of the efficiency of extracting toxic substructures, and the effect of SAs on quantitative structure–activity relationship (QSAR) predictive models. The results demonstrated that PySmash_circular performed best overall, with satisfactory results in substructure information carrying and the rule-based predictive models. PySmash_path and Bioalerts were also recommended for their similar performance to PySmash_circular, but the main problem was that they took too much time and generated too many substructures. Specifically, Bioalerts and PySmash_circular could obtain substructures carrying richer information, while SARpy had the best predictive rule-based models, but it only focuses on precision (PR) value in the evaluation of individual SA. More than that, the substructures obtained by all 7 methods can enhance the recognition ability of the QSAR models for toxic compounds and make them interpretable. Finally, we have also made a baseline substructure set of 43 toxicity endpoints available to the public to facilitate further development of drug research and environmental safety assessment in a rapid and accurate direction.

Scientific Contribution: Based on 43 toxicity datasets, we conducted a comprehensive evaluation of 7 representative substructure extraction tools from both the perspective of individual substructure and substructure-based models. This work not only enables users to make more autonomous choices of the optimal substructure extraction tool, but also provides the public with a benchmark substructure set of 43 toxicity endpoints, promoting the further development of computational toxicology.

The online version contains supplementary material available at 10.1186/s13321-026-01157-x.

## Full-text entities

- **Diseases:** toxicity (MESH:D064420)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12922411/full.md

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12922411/full.md

## References

1 references — full list in the complete paper: https://tomesphere.com/paper/PMC12922411/full.md

---
Source: https://tomesphere.com/paper/PMC12922411