# Exploring similarity patterns in a large scientific corpus

**Authors:** Daniel Witschard, Ilir Jusufi, Kostiantyn Kucher, Andreas Kerren

PMC · DOI: 10.1371/journal.pone.0321114 · PLOS One · 2025-04-21

## TL;DR

This paper introduces a new way to explore scientific publications by analyzing similarity patterns using an embedding-based pipeline and a visual analytics tool.

## Contribution

The novel approach treats similarity as both a relation and a property, enabling dynamic exploration of similarity criteria in large datasets.

## Key findings

- An embedding-based computational pipeline was developed for similarity-based exploration of scientific publications.
- A prototype visual analytics tool was created to support interactive similarity analysis.
- Two use cases demonstrated the potential of the method for uncovering patterns in scientific data.

## Abstract

Similarity-based analysis is a common and intuitive tool for exploring large data sets. For instance, grouping data items by their level of similarity, regarding one or several chosen aspects, can reveal patterns and relations from the intrinsic structure of the data and thus provide important insights in the sense-making process. Existing analytical methods (such as clustering and dimensionality reduction) tend to target questions such as “Which objects are similar?”; but since they are not necessarily well-suited to answer questions such as “How does the result change if we change the similarity criteria?” or “How are the items linked together by the similarity relations?” they do not unlock the full potential of similarity-based analysis—and here we see a gap to fill. In this paper, we propose that the concept of similarity could be regarded as both: (1) a relation between items, and (2) a property in its own, with a specific distribution over the data set. Based on this approach, we developed an embedding-based computational pipeline together with a prototype visual analytics tool which allows the user to perform similarity-based exploration of a large set of scientific publications. To demonstrate the potential of our method, we present two different use cases, and we also discuss the strengths and limitations of our approach.

## Full-text entities

- **Diseases:** SAVED DISTRIBUTIONS (MESH:D020243), HAL (MESH:C538320)
- **Chemicals:** BioVis (-), NO (MESH:D009614)
- **Species:** Mus musculus (house mouse, species) [taxon 10090], Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12011216/full.md

## Figures

10 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12011216/full.md

## References

60 references — full list in the complete paper: https://tomesphere.com/paper/PMC12011216/full.md

---
Source: https://tomesphere.com/paper/PMC12011216