# Unsupervised Approaches to Finding Outliers in Caption-Represented Images

**Authors:** Jakub Zaprzałka, Magdalena Topczewska

PMC · DOI: 10.3390/e27070661 · Entropy · 2025-06-20

## TL;DR

This paper proposes unsupervised methods to detect outliers in text-based image data using distance metrics and multidimensional scaling.

## Contribution

The novel contribution is the application of MDS and agglomerative techniques with specific distance metrics to identify outliers in caption-represented images.

## Key findings

- Cosine distance was found to be the most effective metric for outlier detection.
- The metric-MDS-based algorithm outperformed other methods in human evaluations.
- The proposed methods successfully identified outlier records in image captions.

## Abstract

Both supervised and unsupervised machine learning algorithms are often based on regression to the mean. However, the mean can easily be biased by unevenly distributed data, i.e., outlier records. Batch normalization methods address this problem to some extent, but they also influence the data. In text-based data, the problem is even more pronounced, as distance distinctions between outlier records diminish with increasing dimensionality. The ultimate solution to achieving unbiased data is identifying the outliers. To address this issue, multidimensional scaling (MDS) and agglomerative-based techniques are proposed for detecting outlier records in text-based data. For both methods, two of the most common distance metrics are applied: Euclidean distance and cosine distance. Furthermore, in the MDS approach, both metric and non-metric versions of the algorithm are used, whereas in the agglomerative approach, the last-p and level cutoff techniques are applied. The methods are also compared with a raw-data-based method, which selects the most distant element from the others based on a given distance metric. Experiments were conducted on overlapping subsets of a dataset containing roughly 2000 records of descriptive image captions. The algorithms were also compared in terms of efficiency with a proposed algorithm and evaluated through human judgment based on the described images. Unsurprisingly, the cosine distance turned out to be the most effective distance metric. The metric-MDS-based algorithm appeared to outperform the others based on human evaluation. The presented algorithms successfully identified outlier records.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12294764/full.md

## Figures

15 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12294764/full.md

## References

23 references — full list in the complete paper: https://tomesphere.com/paper/PMC12294764/full.md

---
Source: https://tomesphere.com/paper/PMC12294764