# Context-dependent similarity searching for small molecular fragments

**Authors:** Atsushi Yoshimori, Jürgen Bajorath

PMC · DOI: 10.1186/s13321-025-01032-1 · Journal of Cheminformatics · 2025-05-26

## TL;DR

This paper introduces a new method for finding similarities between small molecular fragments using context-dependent similarity, inspired by natural language processing techniques.

## Contribution

The novelty lies in adapting context-dependent word similarity from NLP to chemical fragments, using neural networks for improved similarity assessment.

## Key findings

- Context-dependent similarity searching improves performance over standard descriptor methods for molecular fragments.
- The approach can detect remote and functionally relevant substituent similarities.
- Different structural or property contexts can be used for flexible similarity queries.

## Abstract

Similarity searching is a mainstay in cheminformatics that is generally used to identify compounds with desired properties. For small molecular fragments, similarity calculations based on standard descriptors often have limited utility for establishing meaningful similarity relationships due to feature sparseness. As an alternative, we have adapted the concept of context-depending word pair similarity from natural language processing to evaluate similarity relationships between substituents (R-groups) taking latent characteristics into account. Context-dependent similarity assessment is based on vector embeddings as fragment representations generated using neural networks. With active analogue series as a model system to establish a global structure–activity context, we demonstrate that this approach is applicable to systematic similarity searching for substituents and increases the performance of standard descriptor representations. Context-dependent similarity searching is capable of detecting remote and functionally relevant similarity relationships between substituents. Alternative search queries are introduced focusing on individual substituents within a global substituent context or individual sequences of substituents establishing a local context. For similarity searching, different structural or structure–property contexts can be established, providing opportunities for various applications.

Previously, we introduced context-dependent similarity assessment for analogue series alignment. The approach is based on the concept of context-dependent similarity of words from natural language processing. Herein, the methodology is extended for similarity searching of small molecular fragments. Context-dependent similarity searching takes latent fragment features into account, representing a new approach for chemical similarity assessment.

## Full-text entities

- **Diseases:** AS (MESH:D000069295), MQN (MESH:C567116)
- **Chemicals:** AS (-), hydrogen (MESH:D006859)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12107754/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12107754/full.md

## References

3 references — full list in the complete paper: https://tomesphere.com/paper/PMC12107754/full.md

---
Source: https://tomesphere.com/paper/PMC12107754