Outliers and anomalies in training and testing datasets for AI-powered morphometry—evidence from CT scans of the spleen

Yuriy Vasilev; Anastasia Pamova; Tatiana Bobrovskaya; Anton Vladzimirskyy; Olga Omelyanskaya; Elena Astapenko; Artem Kruchinkin; Novik Vladimir; Kirill Arzamasov

PMC · DOI:10.3389/frai.2025.1607348·July 15, 2025

Outliers and anomalies in training and testing datasets for AI-powered morphometry—evidence from CT scans of the spleen

Yuriy Vasilev, Anastasia Pamova, Tatiana Bobrovskaya, Anton Vladzimirskyy, Olga Omelyanskaya, Elena Astapenko, Artem Kruchinkin, Novik Vladimir, Kirill Arzamasov

PDF

Open Access

TL;DR

This study explores methods to detect outliers and anomalies in medical datasets used for training AI to measure organ sizes, using spleen CT scans as an example.

Contribution

The study identifies effective methods for detecting anomalies in morphometric datasets, combining visual, statistical, and machine learning approaches.

Findings

01

Visual methods like boxplots and histograms were effective for identifying outliers.

02

Machine learning algorithms such as OSVM, KNN, and autoencoders also proved useful.

03

A total of 32 outlier anomalies were detected in the spleen dataset.

Abstract

Creating training and testing datasets for machine learning algorithms to measure linear dimensions of organs is a tedious task. There are no universally accepted methods for evaluating outliers or anomalies in such datasets. This can cause errors in machine learning and compromise the quality of end products. The goal of this study is to identify optimal methods for detecting organ anomalies and outliers in medical datasets designed to train and test neural networks in morphometrics. A dataset was created containing linear measurements of the spleen obtained from CT scans. Labelling was performed by three radiologists. The total number of studies included in the sample was N = 197 patients. Using visual methods (1.5 interquartile range; heat map; boxplot; histogram; scatter plot), machine learning algorithms (Isolation forest; Density-Based Spatial Clustering of Applications with…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Genes2

PKD2 PCSK1

Proteins2

Species1

Homo sapiens(human · species)

Diseases5

abnormalities of the spleen AI anomalies abnormalities splenomegaly

Figures10

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Radiomics and Machine Learning in Medical Imaging · COVID-19 diagnosis using AI