UMIAD-EGMF: unsupervised medical image anomaly detection based on edge guidance and multi-scale flow fusion
Zhirong Li, Guangfeng Lin, Dou Zhang, Rongxin Huang, Jing Yang

TL;DR
This paper introduces a new method for detecting anomalies in medical images without needing labeled data, improving accuracy by focusing on edges and combining features at different scales.
Contribution
UMIAD-EGMF is a novel unsupervised method that uses edge guidance and multi-scale flow fusion to enhance anomaly detection in medical images.
Findings
UMIAD-EGMF outperforms existing methods on breast ultrasound, brain MRI, and head CT datasets.
The method achieves a 63.36% AUPRO on the BUSI dataset, surpassing MLFEU-net by 0.01%.
On the brain MRI dataset, UMIAD-EGMF achieves 90.83% AUPRO, outperforming MLFEU-net by 0.33%.
Abstract
Medical imaging technology has advanced rapidly in recent years; however, abnormalities in medical images are often rare and complex, making sample labels difficult to obtain for supervised learning of detection models. Existing unsupervised anomaly detection methods, which are the mainstream approaches, often struggle with issues such as blurred edges and varying scales of abnormal regions. To address these issues, a novel unsupervised method for medical image anomaly detection is proposed: unsupervised medical image anomaly detection based on edge guidance and multi-scale flow fusion (UMIAD-EGMF). This method excavates rich edge information with scale adaptation and progressively identifies discriminative information for anomaly detection. UMIAD-EGMF captures contextual information around anomaly boundaries via low-level feature fusion (enhancing boundary details with the edge…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · COVID-19 diagnosis using AI · Domain Adaptation and Few-Shot Learning
Introduction
In recent years, medical imaging technology has advanced significantly owing to deep learning. The traditional manual diagnosis of medical images is often influenced by a doctor’s medical experience, knowledge structure, physiological status (for example, visual fatigue), and other subjective factors. In contrast, deep learning-based methods for medical image anomaly detection can significantly reduce the influence of these human factors on the diagnosis. These methods can help doctors identify abnormalities faster and more accurately, thereby improving the diagnostic efficiency and accuracy. However, owing to the complexity of disease manifestations, unsupervised model learning and the discrimination of abnormal regions face major challenges in the detection of medical image anomalies. Specific issues include intricate blurred edge information and multi-scale characteristics of abnormal regions, as shown in Fig. 1. Intricately blurred edge information is prevalent in the early stages of some diseases and is exacerbated by shadows and scattered noise, particularly in ultrasound images. This makes it difficult for the model to accurately delineate the boundaries of abnormal regions. Multi-scale characteristics refer to the presence of abnormal regions of varying sizes in medical images, which complicate the effective processing these size differences in existing network models.Fig. 1. Influence of the intricately blurred edge information and the multi-scale nature of abnormal regions (compare with informative knowledge distillation (IKD) [1]). The left column shows the different detection results of the multi-scale anomaly region and the right column shows the different detection results of the complex edges of the anomaly region. UMIAD-EGMF: Unsupervised medical image anomaly detection based on edge guidance and multi-scale flow fusion
Existing methods [2, 3] attempt to mitigate edge blur by increasing the width and depth of the network or by modifying the concatenation and skip connection operations, which enable each layer to fully extract image features and boundary details. The high-level feature fusion and recalibration U-Net [4] was derived from the U-Net architecture by modifying skip pathways using local feature reconstruction and a feature fusion mechanism that represents detailed contextual information in high-level features. Lee et al. [5] proposed the multi-receptive field (MRF) module for feature aggregation and feature fusion. The MRF module utilizes the adaptive selection mechanism of multi-branch convolution kernels to achieve adaptive feature extraction. The MRF module fuses the input feature maps using the convolution kernels of different receptive fields, making the obtained information more comprehensive. Shao et al. [6] introduced an artificial eeural network (ANN) based on edge detection methods for edge information processing. To adapt to multi-scale abnormal regions, a deep convolutional neural network (CNN) is used to discriminate between normal and abnormal tissues using a Gaussian pyramid representation for multi-scale analysis [7], a multi-scale context-guided deep network captures both global and local contexts to guide model training [8], and a multi-scale visual deformation model employs a position-coded transform model and a local CNN to extract global and local information [9]. Another approach [10] proposed a novel multi-scale parallel-branch architecture called multi-scale parallel focal self-attention U-Net (MP-FocalUNet). On the encoder side of MP-FocalUNet, dual-scale sub-networks are used to extract information from different scales. Dai et al. [11] proposed a multi-scale multi-task model framework called MTF-UNet. It does not use different-sized convolution kernels for multi-scale feature extraction but instead uses same-sized kernels with varying numbers of convolutions to obtain multi-scale multi-level features.
However, these methods often suffer from limitations in their ability to generalize to different types of medical image datasets because of blurred edges and varying scales of abnormal regions. To address these issues, unsupervised medical image anomaly detection based on edge guidance and multi-scale flow fusion (UMIAD-EGMF) is proposed to mine the common edge information of image regions to accurately detect anomalies in different medical images. In UMIAD-EGMF, the edge guidance module (EGM) serves as a boundary prior for segmentation by extracting shallow image edge features, the edge aggregation module (EAM) enhances segmentation performance by aggregating feature information, and multi-scale flow fusion (MFF) captures common anomalous features with subtle and significant multi-dimensional information through the base distribution.
The contributions of this study are as follows:
(1) The contextual information of the boundaries of image regions can describe edge detail information from the EGM to the EAM through a hierarchical progression strategy to enhance the anomaly detection performance. The EGM can merge low-level appearance features via interactive attention weight learning, and the EAM can then employ attention mechanisms and residual connections to aggregate low- and high-level features.
(2) Multi-scale adaptation detection can eliminate the differences between various scales of image regions to fit anomalous areas using MFF. As different scales of base distributions can theoretically approximate any image region, MFF can exchange local and global information within multi-scale feature pyramids to enhance the detection and segmentation of abnormal regions.
(3) The promising results in medical anomaly detection (MAD) demonstrate that the proposed UMIAD-EGMF outperforms the existing state-of-the-art (SOTA) methods on different types of medical image datasets. Furthermore, the proposed UMIAD-EGMF achieves better performance in different edge information extraction and edge aggregation methods.
To address the blurred edges and varying scales of abnormal regions, existing methods typically utilize edge detection methods and generative-based approaches for edge information mining and region information reconstruction. These studies are reviewed in the following subsections.
Edge detection methods
There are numerous edge-based detection algorithms for medical images, each employing different strategies to identify regions with significant gray level changes, thereby effectively segmenting abnormal areas. Some of these methods utilize classical edge detection techniques, such as Sobel, Laplacian of Gaussian (LoG), and Canny operators. The Sobel operator detects edges by calculating the gradient of grayscale variations in the image. The LoG operator combines Gaussian smoothing with second-order derivatives to detect edges efficiently. The Canny edge detector accurately identifies edges through a multi-step process that includes Gaussian smoothing, gradient calculation, non-maximum suppression, and hysteresis thresholding. Although traditional methods are typically computationally simple, they are susceptible to noise and often struggle to accurately identify complex edges and details.
Modern deep learning algorithms effectively address the limitations of traditional edge detection methods. For example, Shao et al. [6] used a Canny operator-based edge detector as the training output for an ANN. Compared to conventional operators, such as Sobel and Canny, the neural network-based edge detection method captures more comprehensive edge information in magnetic resonance imaging (MRI) medical images and processes it almost three times faster than traditional methods. Several proposed methods based on U-Net [2, 3, 12] can extract richer edge features. A novel CNN architecture [2] integrates an inception-res module and a densely connected convolutional module into a U-Net framework. The connectivity and skip connections are modified to incorporate an EGM that integrates features from all layers [3]. The boundary-aware context neural network [13] follows the classic encoder-decoder structure, which embeds a pyramid edge feature extraction module, mini-multi-task learning module, interactive attention layer, and cross-feature fusion module for object segmentation. The cross-level feature aggregation network [12] introduced a boundary prediction network to generate boundary-conscious features, enhance hierarchical features, and produce finer segmentation maps. The multi-scale low-level feature enhancement U-Net (MLFEU-net) [14] has been presented as a multi-scale low-level feature enhancement U-Net consisting of a U-Net structure, a low-level feature enhancement block, and a parallel multi-scale feature fusion module. Specifically, the low-level feature enhancement block enhances detailed information during shallow downsampling and further feature selection. Meanwhile, in the neck of the U-Net, the parallel multi-scale feature fusion module uses dilated convolutions of different scales to provide receptive fields of varying sizes and efficiently merges filtered shallow and deep features. However, the parallel multi-scale feature fusion module relies on fixed dilation rates and cannot dynamically adapt to extreme lesion scales or establish edge-semantic consistency when fusing shallow and deep features. Recently, MedViTV2 [15] proposed a novel hybrid architecture that integrates kolmogorov-arnold network (KAN) layers into transformers for the first time. MedViTV2 introduces an efficient KAN block that reduces the computational load while improving the accuracy of the original MedViT [16]. To counteract the fragility of MedViT when scaled up, MedViTV2 also uses an enhanced dilated neighborhood attention mechanism. This is an adaptation of the efficient fused dot-product attention kernel, which can capture global context and expanding receptive fields. This allows the model to be scaled effectively and addresses the issue of feature collapse. Furthermore, a hierarchical hybrid strategy is introduced to efficiently stack local feature perception and global feature perception blocks and balance local and global feature perceptions to enhance performance. However, MedViTV2 is primarily designed for image-level classification and lacks explicit optimization for pixel-level anomaly segmentation. It does not incorporate edge guidance to handle blurred lesion boundaries or MFF to capture subtle anomalies (for example, micro-hemorrhages in head computed tomography (CT) scans). This makes it less suitable for clinical anomaly detection tasks that require the precise delineation of lesions.
In summary, existing methods, including U-Net variants and hybrid transformer-KAN models, such as MedViTV2, neglect the hierarchical, progressive interaction of edge information at different scales when describing the commonalities of different types of medical images. Most classification-oriented hybrid models fail to establish effective connections between the edge details and semantic features, resulting in poor performance in pixel-level anomaly localization. The proposed UMIAD-EGMF model attempts to hierarchically model the interaction of low-level appearance features via the EGM while progressively integrating high-level semantic features via the EAM, with the aim of improving anomaly detection and segmentation accuracy.
Generative-based approaches
Generative-based approaches include two types of detection and segmentation methods: generative adversarial network (GAN)-based and flow model methods.
GAN-based methods can be used to reconstruct the regional information from medical images. The anomaly GAN (AnoGAN) [17] first proposed the application of GANs for anomaly detection by mitigating their inherent instability through a feature matching mechanism. However, AnoGAN suffers from slow training and has a limited capability in reconstructing complex images. Unlike previous patch-based approaches, the anomaly variational autoencoder GAN (AnoVAEGAN) [18] addresses these limitations by employing a deep spatial autoencoder to capture the comprehensive global features of normal images, thereby enhancing the realism of the reconstructed samples. The fast AnoGAN [19] leverages the wasserstein-GAN [20] framework to significantly improve training speed while capturing smoother representations. MADGAN [21], an unsupervised multi-stage anomaly detection method, was the first method for reconstructing multiple adjacent brain MRI slices using a GAN network to detect anomalies in multi-sequence MRI images. Building on these advancements, the memory-augmented multi-level cross-attentional masked autoencoder [22] integrated a transformer-based approach with masked autoencoders for accurate anomaly detection.
Flow-based methods not only explicitly learn data distributions but also leverage their reversibility to transform data without information loss and achieve high-quality data reconstruction and generation. For example, FastFlow [23] locates features in normal regions of an input image at the center of the distribution, whereas features in abnormal regions deviate significantly from the center. Similarly, DifferNet [24] implemented a normalizing flow for image-based anomaly detection to handle variations in the defect size via multi-scale feature maps. The conditional normalizing flow framework adopted for anomaly detection [25] constructs a multi-scale feature pyramid by incorporating feature maps with varying efficiency levels and receptive fields. In addition, conditional normalizing flow framework adopted for anomaly detection extends the normalization process to pixel-by-pixel anomaly localization by estimating the likelihood of the feature vector at each location. However, it treats each feature vector independently, ignoring the spatial context information. This lack of contextual awareness impairs global perception, resulting in incoherent localization outcomes. To address this issue, the multi-scale flow-based framework for unsupervised anomaly detection (MSFlow) [26] first extracts feature maps from the initial three stages, down samples them via average pooling, and then feeds them into a normalizing flow. The extracted multi-scale feature maps are then fed into their respective asymmetric parallel flows, with information from different scales and perceptual fields exchanged through a fusion flow. Finally, multi-scale likelihood maps are aggregated using various strategies tailored to different anomaly detection and localization tasks. In addition, medical masked autoencoder (MedMAE) [27] optimizes self-supervised learning for medical images but lacks explicit distribution modeling of flows.
Most current unsupervised anomaly detection algorithms based on flow models are primarily used for industrial defect detection. Drawing inspiration from MSFlow, this study extends the fusion flow model to unsupervised medical image anomaly detection. The proposed UMIAD-EGMF effectively handles lesions of varying and unpredictable sizes in medical images by incorporating edge information mining to enhance the detection and segmentation of anomalous regions. Furthermore, existing hybrid architectures, including transformer-based models such as MedViTV2 and flow-based models such as CFlow, suffer from feature isolation. CFlow uses canny edges as auxiliary inputs but does not fuse them with flow-based semantic features. MedViTV2 focuses on balancing global and local features for classification without linking edge details to segmentation tasks. This leads to inconsistent edge and anomaly localizations. In contrast, the EGM-EAM-MFF pipeline of UMIAD-EGMF realizes progressive integration: the EGM extracts context-aware edges, the EAM embeds these edges into the semantic space of the flow, and MFF ensures that the edge information propagates across flow scales. This end-to-end integration avoids the two-stream separation problem of existing hybrids, filling the gap between classification-oriented hybrid models and the need for clinical anomaly segmentation.
Methods
Network architecture of UMIAD-EGMF
The proposed model employs a codec architecture. The encoder was designed as a CNN feature extractor with multi-scale pyramidal pooling. Initially, four layers of multi-scale features, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_1, F_2, F_3,$$\end{document} and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_4$$\end{document} , are extracted from the input image using this encoder. The EGM extracts edge information from the extracted features. The shallow features \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_1$$\end{document} and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_2$$\end{document} interact with the different scale features for edge information extraction via the EGM, and the edge information can be incorporated with the high-level feature \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_4$$\end{document} through the EAM. Subsequently, the feature vector \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$z^k$$\end{document} from the multi-scale pyramid pooling operation and the conditional vector \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c^k$$\end{document} from the position encoding are fed into \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k$$\end{document} independent decoders \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g^k(z^k;c^k;\theta ^k)$$\end{document} for the multi-scale flow mapping process. This process integrates multi-scale normalized streams and a multi-scale stream fusion module to enhance their information exchange. In the final anomaly detection stage, the model uses an additive aggregation strategy for pixel-level anomaly localization and a multiplicative aggregation strategy for image-level anomaly detection. The anomaly score of the image is determined by calculating the average of the top \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N$$\end{document} scores in the image-level anomaly score map. Figure 2 shows the network architecture of the proposed UMIAD-EGMF.Fig. 2. Network architecture of the proposed UMIAD-EGMF model. \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c^k$$\end{document} refers to the condition vector generated through position encoding, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$z^k$$\end{document} represents the feature vector obtained through multi-scale pyramid pooling operation, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g^k (z^k;c^k;\theta ^k)$$\end{document} is the k-th independent decoder model, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta ^k$$\end{document} represents the parameter of the decoder, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$y^k$$\end{document} indicates the output of the k-th independent decoder model, and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L^k$$\end{document} refers to the loss of the k-th independent decoder model
EGM
This subsection introduces the EGM for extracting edge information from lesion regions in medical images. The proposed UMIAD-EGMF uses ResNet18 [28] as the backbone network. ResNet18 can generate different layers with four features: \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_1, F_2, F_3$$\end{document} , and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_4$$\end{document} . Figure 3 shows the heat maps of these features at different layers in the various images. The low-layer features ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_1$$\end{document} and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_2$$\end{document} ) primarily capture local information, such as edges and textures, whereas the high-layer features ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_3$$\end{document} and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_4$$\end{document} ) focus on all semantic information of the image. To enhance edge information extraction, the EGM can not only fuse these features ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_1$$\end{document} and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_2$$\end{document} ) from the first two layers but also mine the edge information to improve the detection accuracy of abnormal regions. Figure 4 illustrates a schematic of the EGM.Fig. 3. Heat map of different layer features \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_1, F_2, F_3 ,F_4$$\end{document} extracted from the ResNet18 backbone network in the various column images
Fig. 4. Schematic of the edge guidance module. \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_1$$\end{document} and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_2$$\end{document} are the low-level features
Specifically, the EGM first downsamples the shallow feature \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_1$$\end{document} , which captures fine textures, such as small lesion edges, to match the size of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_2$$\end{document} , which captures coarser local structures. The downsamples feature of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_1$$\end{document} is denoted as \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_1'$$\end{document} . To ensure that the edge detection covers both fine and coarse details, the key to blurring ultrasound lesion images, two sets of initial edge attention weights ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$E_1$$\end{document} and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$E_2$$\end{document} ) are generated by swapping the order of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_1'$$\end{document} and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_2$$\end{document} during concatenation as follows:
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} E_1 = \text {Conv}^{1}_{1 \times 1} \left(\text {Concat}\left(F_1', F_2\right)\right) \end{aligned}$$\end{document} \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} E_2 = \text {Conv}^{2}_{1 \times 1} (\text {Concat}(F_2, F_1') \end{aligned}$$\end{document}Equation 1 prioritizes \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_1'$$\end{document} , the downsampled fine texture, by placing it first in the concatenation, making \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$E_1$$\end{document} sensitive to small edge details, such as the boundary of a 2 mm breast cyst. Equation 2 prioritizes \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_2$$\end{document} , the coarse structure, by reversing the order, making \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$E_2$$\end{document} focus on larger edge contours, such as the outer edge of a breast tumor. Different gated convolutions \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {Conv}_{1 \times 1}$$\end{document} focus on different parts of the feature map by learning the weights for each position in the input feature map, allowing the model to adaptively fuse information from different feature maps. After convolution, the features are processed through cascading operations and softmax processing as follows:
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} E_{\text {att}} = \text {softmax}(\text {Concat}(E_1, E_2))\end{aligned}$$\end{document}where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$E_{\text {att}}$$\end{document} , which balances the attention between small edge details and large edge contours, is the interaction weight between different weight distributions. After normalization using the softmax activation function, the interaction weight is between 0–1. Furthermore, the normalized attention vector is split into \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$E_1'$$\end{document} , the refined attention weight for small edge details, and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$E_2'$$\end{document} , the refined attention weight for large edge contours, for different low-layer feature fusions.
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} E_1', E_2' = \text {split}(E_{\text {att}}) \end{aligned}$$\end{document} \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} F_{\text {\textit{edge\_fusion}}} = E_1' \otimes F_1' \oplus E_2' \otimes F_2 \end{aligned}$$\end{document}where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {split}(\cdot )$$\end{document} denotes the split operation, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\otimes$$\end{document} represents pixel-wise multiplication after dimension broadcasting, and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\oplus$$\end{document} represents pixel-wise addition. \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_{\text {\textit{edge\_fusion}}}$$\end{document} , the feature that integrates small details and large contours, denotes the edge enhancement fusion. This fusion enables the model to focus on important parts of the feature map, thereby improving the edge information for medical image anomaly detection.
Taking a breast ultrasound images dataset (BUSI) of benign lesion image (as shown in Visualization subsection) as an example, the lesion edge is blurred by ultrasound speckle noise, making it difficult for traditional methods to distinguish the lesion from the surrounding fat tissue. The EGM fuses low-level features \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_1$$\end{document} (texture details) and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_2$$\end{document} (local edges) via interaction attention (Eq. 5), which assigns higher attention weights to pixel regions with gray-level changes (lesion boundaries), while suppressing noise regions. This results in \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_{\text {\textit{edge\_fusion}}}$$\end{document} . The EAM further aggregates this edge information with a high-level feature \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_4$$\end{document} , correcting the mis-localization of CFlow [25] which marks three small false-positive regions and reduces the false-positive rate.
EAM
The EGM effectively extracts edge information. However, aggregating this information and high-level semantic information within the network remains a challenge because of noise and redundancy. To optimize the use of edge-aware features in resolving these issues, the EAM is introduced to further extract information regarding edge fusion and capture features that are aggregated at different scales. The incorporation of edge information in conjunction with the channel attention mechanism facilitates the capacity of the model to discern anomalous patterns in unlabeled data during unsupervised learning, particularly in regions where the edges are imperceptible. Consequently, the edge fusion feature \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_{\text {\textit{edge\_fusion}}}$$\end{document} is fed from the EGM into the channel attention block (CAB) [29] to obtain the edge enhancement feature \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_{\text {\textit{edge\_fusion}}}^{\text {\textit{att}}}$$\end{document} . It optimizes the importance of different feature channels, assigning higher weights to channels containing lesion edges and lower weights to channels dominated by noise, as follows:
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} F_{\text {\textit{edge\_fusion}}}^{\text {\textit{att}}} = \text {CAB} (F_{\textit{edge\_fusion}}) \end{aligned}$$\end{document}where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_{\text {\textit{edge\_fusion}}}^{\text {\textit{att}}}$$\end{document} is the edge feature enhanced by the channel attention mechanism and CAB ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\cdot$$\end{document} ) is the channel attention operation. The enhanced edge feature \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_{\text {\textit{edge\_fusion}}}^{\text {\textit{att}}}$$\end{document} is then merged with the deeper feature \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_4$$\end{document} in the network and processed through a 3 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document} 3 convolution to obtain the feature \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_{\text {\textit{edge}}}^{\text {\textit{cat}}}$$\end{document} . The feature contains both specific lesion edge details and global lesion attributes.
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} F_{\text {\textit{edge}}}^{\text {\textit{cat}}} = \text {Conv}_{3 \times 3} \left(\text {concat}\left(F_{\text {\textit{edge\_fusion}}}^{\text {\textit{att}}}, F_4\right)\right) \end{aligned}$$\end{document}The ability of the model to capture edge details is enhanced by fusing information from different feature layers in the spatial dimension. Feature \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_{\text {\textit{edge}}}^{\text {\textit{cat}}}$$\end{document} is then fed into the global average pooling (GAP) layer. The output of the GAP layer enhances the cascaded features, improving the perception of edge information by the model, and resulting in an aggregated feature \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_{\text {\textit{edge}}}^{\text {\textit{agg}}}$$\end{document} . This is similar to an enhanced version of the lesion edge map that preserves the subtle edge details of the lesion while integrating its overall semantic information.
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} F_{\text {\textit{edge}}}^{\text {\textit{agg}}} = \text {GAP} (F_{\text {\textit{edge}}}^{\text {\textit{cat}}}) \otimes F_{\text {\textit{edge}}}^{\text {\textit{cat}}} \oplus F_4 \end{aligned}$$\end{document}where GAP ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\cdot$$\end{document} ) represents the GAP operation. In addition, to preserve high-layer feature information, residual connections can be used to fuse high-level features in the current hierarchy, thereby enhancing the expressiveness and learning efficiency of the network. A schematic of the EAM is presented in Fig. 5.Fig. 5. Schematic of the EAM. \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_4$$\end{document} is the high-level feature, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_{\text {\textit{edge\_fusion}}}$$\end{document} refers to the edge fusion feature extracted by the EGM, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_{\text {\textit{edge\_fusion}}}^{\text {\textit{att}}}$$\end{document} represents the edge enhancement feature obtained by \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_{\text {\textit{edge\_fusion}}}$$\end{document} through the CAB, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_{\text {\textit{edge}}}^{\text {\textit{cat}}}$$\end{document} indicates the combination of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_4$$\end{document} and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_{\text {\textit{edge\_fusion}}}^{\text {\textit{att}}}$$\end{document} , and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_{\text {\textit{edge}}}^{\text {\textit{agg}}}$$\end{document} refers to the final aggregated feature
MFF module
Medical images typically contain complex biological tissue structures exhibiting different details and features at various scales. The effective acquisition and fusion of these multi-scale features are crucial for improving the accuracy of medical image anomaly detection. Existing methods utilize multi-scale strategies for operating on features at each scale in isolation and fail to fully exploit multi-scale interactions. To address this issue, this study introduces an MFF module that captures anomalous features ranging from subtle to significant by pooling and averaging feature maps at different scales. This module provides rich and multi-dimensional information to the model, thereby significantly enhancing its accuracy and sensitivity in identifying and localizing anomalous structures of various sizes. A schematic of the MFF module is shown in Fig. 6.Fig. 6. Schematic of the multi-scale flow fusion module. \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$y^k$$\end{document} ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k=1,2,3,4$$\end{document} ) indicates the output of the k-th independent decoder model for mapping flow features
First, an average pooling operation is performed on the multi-scale features \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$y^1, y^2,$$\end{document} and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$y^3$$\end{document} , retaining the fine details and supplementing them with deep semantics, from the first three levels to match the dimensions of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$y^4$$\end{document} , retaining the global semantics and supplementing them with shallow details. The pooled features \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$y_{\text {AP}}^1, y_{\text {AP}}^2,$$\end{document} and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$y_{\text {AP}}^3$$\end{document} are then concatenated with \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$y^4$$\end{document} , combining the feature information from different levels and enabling the model to comprehensively sense the image structure at various scales. Next, the concatenated features are fused using a 3 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document} 3 convolution operation to better capture the correlations between the features. Finally, the first three levels of features are upsampled to restore their original sizes, and the original features are added to the fused features. This process retains information from the original features to further improve the ability of the model to characterize the features.
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \left\{y_{\text {f}}^1, y_{\text {f}}^2, y_{\text {f}}^3, y_{\text {f}}^4\right\}= f_{\text {f}} \left(\text {AP}\left(y^1\right),\mathrm{AP}\left(y^2\right),\mathrm{AP}\left(y^3\right), y^4\right) \end{aligned}$$\end{document}where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$y_{\text {f}}^k$$\end{document} denotes the k-th layer features of the MFF module, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f_{\text {f}}$$\end{document} represents the MFF module, and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {AP}(\cdot )$$\end{document} denotes the average pooling operation. \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$y^k$$\end{document} ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k=1,2,3,4$$\end{document} ) are the features after multi-scale normalizing flow decoding as follows:
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} y^k = \text {g}^k \left( z^k; c^k;\theta ^k \right) \end{aligned}$$\end{document}where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c^k$$\end{document} denotes the condition vector generated through position encoding, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$z^k$$\end{document} denotes the feature vector obtained through the multi-scale pyramid pooling operation, and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta ^k$$\end{document} denotes the decoder parameter. \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g^k (z^k;c^k;\theta ^k)$$\end{document} converts the distribution of normal tissue features into a standard distribution, making it easier to identify lesion features that deviate from this distribution. It is the k-th independent decoder from the normalizing flow [30], which can transform the probability density through a series of reversible and differentiable mappings. In this manner, different feature levels in an image can be captured, thereby locating abnormal areas at different scales in medical images. In the proposed UMIAD-EGMF, multiple affine coupling blocks can be stacked to construct sufficiently complex distributions. \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g^k (z^k;c^k;\theta ^k)$$\end{document} is composed of eight affine coupling blocks \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h_n(z^k;c^k;\theta ^k)$$\end{document} ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=1,2,3,4,5,6,7,8$$\end{document} ) which handle the irregular feature distribution of blurred ultrasound lesions. The affine coupling block proposed by ref. [25] can be expressed as follows:
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} g^k(z^k;c^k;\theta ^k)= h_1(z^k;c^k;\theta ^k)\circ h_2(z^k;c^k;\theta ^k)\circ \cdots \circ h_8(z^k;c^k;\theta ^k) \end{aligned}$$\end{document}where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\circ$$\end{document} denotes the cascading operation.
UMIAD-EGMF uses the maximum likelihood of the targets for training, which is equivalent to minimizing the loss \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {L}^k$$\end{document} in the k-th layer. According to CFlow [25], loss \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {L}^k$$\end{document} can be reformulated as follows:
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \mathcal {L}^k \approx D_{KL} \left[ p_{Z} (\boldsymbol{z}^k) \Vert \hat{p}_{Z} (\boldsymbol{z}^k| \boldsymbol{c}^k, \boldsymbol{\theta }^k) \right] \frac{1}{N} \sum \limits _{i=1}^{N} \left[ \frac{\Vert \boldsymbol{u}_i^k\Vert _2^2 }{2} - \log \left| \det \boldsymbol{J}_i^k \right| \right] + \text {const} \end{aligned}$$\end{document}where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_{Z} (\boldsymbol{z}^k)$$\end{document} represents the true probability density of the feature \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\boldsymbol{z}^k$$\end{document} in Z feature space, and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{p}_{Z} (\boldsymbol{z}^k| \boldsymbol{c}^k, \boldsymbol{\theta }^k)$$\end{document} represents the prediction probability density of the feature \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\boldsymbol{z}^k$$\end{document} under the condition of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\boldsymbol{c}^k$$\end{document} and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\boldsymbol{\theta }^k$$\end{document} in Z feature space. The random variable \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\boldsymbol{u}_i^k= g^{-1}(z_i^k, c_i^k, \theta ^k)$$\end{document} is Jacobian \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$J_i^k = \nabla _{z^k} g^{-1}(z_i^k, c_i^k, \theta ^k)$$\end{document} .
Computational complexity
This subsection analyzes and explains the computational complexity of the proposed UMIAD-EGMF, which is mainly divided into four parts: EGM, EAM, normalizing flow, and MFF. In these parts, it is assumed that the maximum number of input channels, output channels, and height and width of features in convolutional operations are \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_{in}$$\end{document} , \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_{out}$$\end{document} , H, and W respectively, and that the maximum dimension of the features in multiplicative operations is N.
First, the computational complexity of the EGM is mainly borne by two \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1 \times 1$$\end{document} convolution kernels and two multiplication operations. It is expressed as \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(2 \times C_{in}\times C_{out} \times H \times W+ 2 \times N^3)$$\end{document} .
Second, the computational complexity of the EAM involves the CAB module with a maximum feature dimension K, a \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$3 \times 3$$\end{document} convolution kernel, and a multiplication operation. It is expressed as \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O( K^3 + 3 \times 3 \times C_{in}\times C_{out} \times H \times W + N^3 )$$\end{document} .
Third, the eight decoupling blocks contribute to the computational complexity of the normalizing flow decoder module. Each block involves two sets of double \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$3 \times 3$$\end{document} convolution kernels and multiplication operations. Therefore, the computational complexity of this module is \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O((2 \times 3 \times 3 \times C_{in}\times C_{out} \times H \times W + N^2)\times 2 \times 8)$$\end{document} .
Finally, the MFF module mainly includes two \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$3 \times 3$$\end{document} convolution kernels. Thus, the computational complexity of the MFF module is \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(2 \times 3 \times 3 \times C_{in}\times C_{out} \times H \times W )$$\end{document} .
Therefore, the total computational complexity of the proposed UMIAD-EGMF is as follows:
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} & O ((2 \times C_{in}\times C_{out} \times H \times W+ 2 \times N^3)+ (K^3 + 3 \times 3 \times C_{in}\times C_{out} \times H \times W + N^3)+((2 \times 3 \times 3 \times C_{in}\times C_{out} \times H \times W + N^2)\times 2 \times 8)+ (2 \times 3 \times 3 \times C_{in}\times C_{out} \times H \times W) ) \nonumber \\ & = O(K^3+317\times C_{in}\times C_{out} \times H \times W+ 16\times N^2+ 3\times N^3) \end{aligned}$$\end{document}Results
Datasets
In unsupervised learning, only normal images are required as input during training. Medical image datasets included normal and abnormal images and lesion region mask data for abnormal images; however, these data are often scarce in public medical datasets. Therefore, in addition to validating the model on the public BUSI dataset [31], this also obtained brain MRI [32] and head CT datasets [33] to validate the model. All three datasets are publicly available and cover different modalities (ultrasound, MRI, and CT), common diseases (breast lesions, gliomas, and brain hemorrhages), and clinical acquisition protocols. Their diversity validates the generalization of UMIAD-EGMF across real-world medical imaging scenarios.Fig. 7. Example of a medical anomaly detection dataset. The first row shows examples of normal samples (framed in blue), while the second row shows examples of abnormal samples (framed in red). The first two columns are from the breast ultrasound images dataset (BUSI), the following four columns are from the brain magnetic resonance imaging (MRI) dataset, and the last six columns are from the head computed tomography (CT) dataset
The BUSI dataset contained 746 breast ultrasound images from 600 patients aged 25–75 years, with an average image size of 500 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document} 500 pixels. This dataset included 437 benign, 210 malignant, and 99 normal images. Healthy images in the brain MRI dataset were obtained from the national institutes of health human connectome project and collected from more than 1,200 healthy volunteers. Each volunteer provided four types of images: structural brain images (T1w and T2w), resting-state functional MRI, and task-state functional MRI. Diseased images were sourced from the brain tumor segmentation dataset [32], which contained the MRI scans of 351 subjects, with each sample including T1, T1ce, T2, and FLAIR MRI images. For this experiment, the brain MRI dataset comprised of 88 normal and 154 diseased images. The head CT dataset included CT slices of the normal brain and hemorrhage images. Ninety normal and 100 diseased images were used in this experiment. The statistics of the datasets are listed in Table 1. Examples of the three datasets are shown in Fig. 7. Table 1. Statistics of the medical anomaly detection datasetsDatasetTrainTest (normal)Test (anomaly)Breast ultrasound images dataset9934647Brain magnetic resonance imaging dataset8810154Head computed tomography dataset9010100Total27754901
Evaluation metrics
MAD involves abnormality image classification (image level) and abnormal region segmentation (pixel level). Therefore, this study used both image- and pixel-level evaluation metrics, including image- and pixel-level area under the receiver operating characteristic curve (AUROC), area under the precision-recall curve for object localization (AUPRO), recall, and precision. The image-level AUROC assesses the binary classification performance of the model at the image-level, whereas the pixel-level AUROC evaluates the ability of the model to localize lesion regions in medical images. The pixel-level AUROC tends to favor larger abnormalities, whereas the AUPRO metric ensures that both large and small abnormalities are equally important for localization. Pixel-level recall represents the ratio of truly abnormal pixels, such as lesion pixels, that are accurately identified, which is the key to covering small or scattered lesions. Pixel-level precision indicates the ratio of pixels predicted as abnormal to those that are truly abnormal, ensuring accurate lesion boundary localization without mislabeling adjacent normal tissues. Additionally, recall and precision typically exhibit a trade-off, where improving one metric may compromise the other, and their balance should be adjusted based on specific clinical tasks.
Implementation details
Taking an image with an input size of 256 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document} 256 as an example, Table 2 lists the main configuration parameters of the network structure, including the input and output sizes of the different network layers. The network contained three parts: the feature extraction phase, multi-scale stream decoding phase, and fusion flow phase. In the feature extraction phase, ResNet18 was used as the backbone network and initialized with pre-trained weights from the ImageNet dataset. The training parameters were as follows: the number of training iterations was 100, the Adam optimizer was used with a learning rate of 0.0002, batch size was 32, and a normalizing flow model was generated using the FrEIA library [34]. Table 2. Network details parameter sheet (Input batch size: 32 for all layers)Network layerInput sizeOutput sizeKey parameterFeature extractorInput_size(32, 3, 256, 256)(32, 3, 256, 256)Conv2d + BN + ReLu(32, 3, 256, 256)(32, 64, 64, 64)Output_channels = 64, kernel_size = 7 × 7Maxpool(32, 64, 64, 64)(32, 64, 32, 32)Output_channels = 64, kernel_size = 3 × 3Layer1(32, 64, 32, 32)(32, 64, 32, 32)Output_channels = 64, kernel_size = 3 × 3Layer2(32, 64, 32, 32)(32, 128, 16, 16)Output_channels = 128, kernel_size = 3 × 3Layer3(32, 128, 16, 16)(32, 256, 8, 8)Output_channels = 256, kernel_size = 3 × 3Layer4(32, 256, 8, 8)(32, 512, 4, 4)Output_channels = 512, kernel_size = 3 × 3Multi_flowDecoder layer1(32, 64, 32, 32)(32, 64, 32, 32)Decoder layer2(32, 128, 16, 16)(32, 128, 16, 16)Decoder layer3(32, 256, 8, 8)(32, 256, 8, 8)Decoder layer4(32, 512, 4, 4)(32, 512, 4, 4)Fusion_flowAvgPool2d layer1(32, 64, 32, 32)(32, 64, 4, 4)Kernel_size = 3, stride = 8, padding = 0AvgPool2d layer2(32, 128, 16, 16)(32, 128, 4, 4)Kernel_size = 3, stride = 4, padding = 0AvgPool2d layer3(32, 256, 8, 8)(32, 256, 4, 4)Kernel_siz e= 3, stride = 2, padding = 0Layer4(32, 512, 4, 4)(32, 512, 4, 4)(32, 64, 4, 4)Concat(32, 128, 4, 4)(32, 960, 4, 4)(32, 256, 4, 4)(32, 512, 4, 4)Conv2d(32, 960, 4, 4)(32, 240, 4, 4)Output_channels = 240, kernel_size = 1 × 1Conv2d(32, 240, 4, 4)(32, 960, 4, 4)Output_channels = 960, kernel_size = 1 × 1(32, 64, 4, 4)Split(32, 960, 4, 4)(32, 128, 4, 4)(32, 256, 4, 4)(32, 512, 4, 4)UpSample layer1(32, 64, 4, 4)(32, 64, 32, 32)UpSample layer2(32, 128, 4, 4)(32, 128, 16, 16)UpSample layer3(32, 256, 4, 4)(32, 256, 8, 8)Layer4(32, 512, 4, 4)(32, 512, 4, 4)
Comparison with the SOTA methods
This subsection compares the proposed UMIAD-EGMF with nine methods (supplemented with MedViTV2), including CFlow [25], u-shaped normalizing flow (UFlow) [35], MSFlow [26], attention-augmented differentiable top-k feature adaptation (ADFA) [36], informative knowledge distillation (IKD) [1], reverse distillation for anomaly detection (RD4AD) [37], MedMAE [27], MLFEU-net [14], and MedViTV2 [15]. These methods fall into three categories: (1) CFlow, UFlow, and MSFlow, which are based on the normalizing flow framework; (2) ADFA, IKD, and RD4AD, which rely on the knowledge distillation paradigm; and (3) MedMAE, MLFEU-net, and MedViTV2, where MedMAE is based on VAE, MLFEU-net is based on U-Net, and MedViTV2 is a transformer-based method. Specifically, CFlow treats features at each scale independently but ignores the correlations between different scales. UFlow incorporates interactive features via a u-shaped network but loses detailed information owing to different sampling strategies. Although MSFlow uses an asymmetric parallel flow architecture and multi-scale information fusion to enhance multi-scale perception, it neglects the importance of edge information for segmenting abnormal regions and fails to adequately represent refined features. ADFA is an unsupervised medical image anomaly detection method that improves the separability of normal and abnormal medical image features through feature adaptation. However, this method is only effective for images with high heterogeneity and is unsuitable for detecting blurred normal and abnormal images with high homogeneity. Both IKD and RD4AD utilize knowledge distillation frameworks to discriminate cross-domain data but still require training support from cross-domain knowledge. To address the above limitations of existing methods, the proposed UMIAD-EGMF integrates edge information into a multi-scale flow model through an unsupervised learning approach, enabling accurate detection of anomalous images and regions. All comparative experiments used images of size 256 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document} 256. Table 3. Performance (%) comparison of the proposed UMIAD-EGMF with the SOTA methods on breast ultrasound images, brain magnetic resonance imaging, and head computed tomography datasetsMethodDET.AUROCSeg.AUROCSeg.AUPRORecallPrecisionBUSI datasetCFlow (2022)[25]92.4374.7659.1675.8282.35UFlow (2024)[35]65.8472.8132.7370.1578.62MSFlow (2024)[26]85.8976.7862.1977.5384.18ADFA (2023)[36]95.4073.2060.4778.2185.21IKD (2022)[1]94.6363.1944.6072.3481.79RD4AD (2022)[37]87.2068.0051.3073.8882.65MedMAE (2025)[27]93.8574.6261.5377.5482.12MLFEU-net (2025)[14]94.9279.2163.3579.3483.53MedViTV2 (2025)[15]94.7076.5061.9077.8083.20UMIAD-EGMF95.8377.8163.3678.9186.79Brain MRI datasetCFlow (2022)[25]87.1884.8455.4882.4585.73UFlow (2024)[35]79.6870.6240.3173.2880.19MSFlow (2024)[26]85.8978.7858.1980.6783.92ADFA (2023)[36]94.6079.4055.3185.2486.87IKD (2022)[1]89.2283.3250.5483.1584.62RD4AD (2022)[37]90.7082.7057.5082.8385.19MedMAE (2025)[27]88.9181.7757.3083.2185.45MLFEU-net (2025)[14]90.5087.5059.1084.3084.10MedViTV2 (2025)[15]90.2083.8057.8083.5084.80UMIAD-EGMF90.8386.3459.4084.6787.52Head CT datasetCFlow (2022)[25]80.7084.7757.4981.6983.58UFlow (2024)[35]63.5071.4031.3671.8579.43MSFlow (2024)[26]73.7080.9946.3679.8282.76ADFA (2023)[36]85.7083.2060.4884.1984.35IKD (2022)[1]83.1087.3761.3185.4783.91RD4AD (2022)[37]74.5087.4061.8085.2283.68MedMAE (2025)[27]80.7082.4259.8782.5681.23MLFEU-net (2025)[14]82.5087.60****62.6583.5682.30MedViTV2 (2025)[15]82.2084.2060.9082.9082.70UMIAD-EGMF82.9086.7662.2483.9883.12DET.AUROC denotes the AUROC of detection at the image-level. Seg.AUROC and Seg.AUPRO represent the AUROC and AUPRO of segmentation at the pixel-level, respectively. Bold and underlined values indicate the best and the second-best performance
Table 3 lists the experimental results of the proposed UMIAD-EGMF compared with those of the eight methods across the three datasets. The results demonstrate that UMIAD-EGMF achieved superior performance in most evaluation metrics, with particular advantages in pixel-level anomaly localization, which is consistent with its design goal of multi-scale anomaly detection. A detailed analysis of the results obtained using the datasets is presented as follows.
On the BUSI dataset, UMIAD-EGMF outperformed all methods for DET.AUROC (95.83%) and achieved the highest Seg.AUPRO (63.36%). For DET.AUROC, it exceeded ADFA (the second-ranked method) by 0.43% and surpassed MedViTV2 (94.70%) by 1.13%, demonstrating stronger image-level anomaly identification. For pixel-level segmentation, its Seg.AUPRO (63.36%) was 1.46% higher than MedViTV2 (61.90%), and its precision (86.79%, the highest) effectively balanced missed detections and false alarms. This advantage stems from the multi-scale pooling and edge fusion module of UMIAD-EGMF, which accurately analyzes the boundaries of breast lesions. This is crucial for distinguishing small/irregular ultrasound abnormalities, whereas MedViTV2 lacks this ability because of its classification-oriented design.
For the brain MRI dataset, UMIAD-EGMF balanced image-level detection and pixel-level segmentation, maintaining the leading segmentation performance. Although its DET.AUROC (90.83%) was 3.77% lower than that of ADFA (94.60%), it outperformed MedViTV2 (90.20%) by 0.63% for DET.AUROC and achieved a 1.60% higher Seg.AUPRO (59.40% vs 57.80% for MedViTV2). The DET.AUROC advantage of ADFA comes from its top-k operator and pre-trained network generalization, whereas the multi-scale decoder and flow fusion of UMIAD-EGMF enhance local anomaly perception, making it more effective for fine-grained brain lesion localization. Its highest precision (87.52%) further reduced false positives in normal brain tissue, which is challenging for MedViTV2 owing to its lack of edge-semantic consistency.
On the head CT dataset, UMIAD-EGMF delivered competitive segmentation performance. Its Seg.AUPRO (62.24%) ranked second only to MLFEU-net (62.65%) and surpassed MedViTV2 (60.90%) by 1.34%. Its recall (83.98%) and precision (83.12%) were among the top three metrics, ensuring reliable localization. While IKD (87.37%) and RD4AD (87.40%) had slightly higher Seg.AUROC via distillation-based supervisory signals, UMIAD-EGMF outperformed MedViTV2 in all pixel-level metrics, confirming its adaptability to high-density tissues and complex anatomical structures in Head CT. This advantage stems from the EGM and MFF of UMIAD-EGMF, which enhance the perception of lesion edges by the model.
Table 4 lists the inference time and number of model parameters of the proposed UMIAD-EGMF and SOTA methods on the BUSI dataset. As the input image size was uniformly set to 256 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document} 256, selecting the BUSI dataset for this experiment yielded representative results, where consistent input conditions ensured the comparability of the inference time and parameter metrics across the methods.
UMIAD-EGMF achieved an inference time of 55.2 ms, which falls within the real-time threshold for clinical medical imaging (less than 100 ms) and thus satisfies the real-time requirements of MAD. The inference time of UMIAD-EGMF was only 2.4 ms longer than that of MedMAE (52.8 ms); this increment is negligible for most clinical scenarios and is accompanied by significant improvements in detection performance. The inference time of UMIAD-EGMF was 1.8 ms shorter than that of MedViTV2, which indicates that the proposed method is superior to current transformer-based methods in terms of detection speed and accuracy. In terms of model parameters, UMIAD-EGMF maintained a moderate parameter count compared to SOTA methods, avoiding excessive computational overheads that could hinder deployment on resource-constrained clinical devices. This confirms the feasibility of UMIAD-EGMF for real-time clinical use and clarifies the rational trade-off between model complexity, inference efficiency, and detection performance. Table 4. Inference time per image and number of model parameters comparison of different methods on the breast ultrasound images datasetMethodInference time per image (ms)Number of model parametersCFlow (2022) [25]42.313 MMSFlow(2024) [26]51.715 MADFA(2023) [36]47.519 MMLFEU-net(2025) [14]54.712.3 MMedMAE(2025) [27]52.840 MMedViTV2(2025) [15]57.029.6 MUMIAD-EGMF55.214 MBold values indicate the best model performance (the smallest inference time or the least number of model parameters)
Fig. 8. Performance (%) curve comparison of the proposed UMIAD-EGMF with the SOTA methods on the breast ultrasound images (left), brain magnetic resonance imaging (middle), and head computed tomography (right) datasets. The vertical axis represents performance, while the horizontal axis indicates the different methods
Figure 8 presents performance curve comparisons between the proposed UMIAD-EGMF and the SOTA methods on the BUSI, brain MRI, and head CT datasets. These curves clearly show the performance trends of the different methods. It is evident that the proposed UMIAD-EGMF outperformed the SOTA methods in most metrics, which further confirms the superior generalization of the proposed UMIAD-EGMF across different datasets.Fig. 9. Visualization comparison of detection and segmentation results. The first row shows the original images, the second row shows the ground truth images, and the third to eighth rows show the detection and segmentation results from CFlow [25], UFlow [35], MSFlow [26], ADFA [36], IKD [1], RD4AD [37], MedMAE [27], MLFEU-net [14], and MedViTV2 [15]. The ninth row shows the detection and segmentation results obtained by the proposed UMIAD-EGMF
Visualization
Figure 9 shows the visual results of the proposed UMIAD-EGMF and SOTA methods across different datasets. The first two columns show example images and their detection and segmentation results on the BUSI dataset, the next two columns show these results on the brain MRI dataset, and the last two columns present those on the head CT dataset. To demonstrate the ability of UMIAD-EGMF to handle multi-scale anomalies, example images containing both small and large anomalies were selected from each dataset. The proposed UMIAD-EGMF exhibited superior accuracy in localizing lesion regions. Specifically, while anomaly detection methods based on the normalizing flow framework tend to produce multiple small detection regions around anomalies, UMIAD-EGMF can eliminate these small interference regions through multi-scale edge information inverse projection. Similarly, while ADFA and other knowledge distillation-based related methods showed relatively few misdetections, UMIAD-EGMF further reduced such misdetections by refining the edge information.
Performance impact of image resolution
To demonstrate the effect of different input image sizes, experiments were conducted based on varying resolutions of the BUSI, brain MRI, and head CT datasets. The results listed in Table 5 indicate that a 256 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document} 256 resolution yielded optimal performance on the BUSI and brain MRI datasets. Conversely, the head CT dataset achieved optimal results with a 128 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document} 128 resolution. This difference can be attributed to the positive influence of the higher resolution in capturing critical information for the complex and detailed structures present in the BUSI and brain MRI datasets, while the head CT dataset, which is characterized by larger lesion areas and pronounced gray-scale variations, may sufficiently capture necessary information with a lower resolution of the 128 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document} 128. However, the proposed UMIAD-EGMF exhibited the worst performance at the highest resolution of 512 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document} 512 in the three datasets, mainly because the noise information increased with the increasing resolution. Therefore, a 256 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document} 256 resolution was adopted as the default image size, which balances sufficiently detailed information extraction and complex high-frequency noise suppression. Table 5. Comparative results of different resolutions on the breast ultrasound images, brain magnetic resonance imaging, and head computed tomography datasets (%)ResolutionDET.AUROCSeg.AUROCSeg.AUPROBUSI dataset128 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document} 12888.2775.2461.66256 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document} 25695.8377.8163.36512 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document} 51274.1773.2959.20Brain MRI dataset128 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document} 12888.1282.5459.42256 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document} 25690.8386.3459.40512 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document} 51282.0874.5048.27Head CT dataset128 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document} 12880.1089.21****63.53256 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document} 25682.9086.7662.24512 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document} 51270.0484.3856.40Bold values indicate the best performance (the highest DET.AUROC, Seg.AUROC, or Seg.AUPRO value) for each dataset under different resolutions
Ablation experiments
To evaluate the effectiveness of the proposed UMIAD-EGMF, related ablation experiments involving different modules, edge extraction methods, and edge aggregation methods were performed on the BUSI, brain MRI, and head CT datasets. Table 6. Comparative results of the ablation experiments on the breast ultrasound images, brain magnetic resonance imaging, and head computed tomography datasets (%)EGMEAMMFFDatasetDET.AUROCSeg.AUROCSeg.AUPROBUSI92.4374.7659.16Brain MRI87.1884.8455.48Head CT80.7084.7757.49BUSI93.4674.9960.47✓Brain MRI88.7185.3657.47Head CT81.1585.0858.97BUSI94.1675.2962.68✓✓Brain MRI89.9686.0258.37Head CT82.1585.9860.70BUSI94.8374.3161.48✓Brain MRI89.0385.9757.49Head CT81.8186.1259.83BUSI95.8377.8163.36✓✓✓Brain MRI90.8386.3459.40Head CT82.9086.7662.24✓ indicates that the corresponding module (EGM/EAM/MFF) is enabled; Bold values indicate the best performance (the highest DET.AUROC, Seg.AUROC, or Seg.AUPRO value) for each dataset in the ablation experiments. EGM Edge guidance module, EAM Edge aggregation module, MFF Multi-scale flow fusion
Different modules In Table 6, these results show that the EGM alone can help the network detect anomalies based on edge information. However, the performance improvement of the model was modest. When the EGM and EAM are combined into the proposed model, better performance can be achieved. Additionally, the MFF module can contribute to similar improvements in anomaly detection. Overall, the combination of the three complementary modules achieves the best anomaly detection performance as it integrates key factors, such as edge information, multi-scale flow features, and global contexts, thereby confirming the effectiveness of each module.
Different edge extraction methods Image edges are crucial features where significant intensity or color changes occur between different regions, providing vital information about object contours and shapes in image detection and segmentation tasks. To enhance the efficiency of edge extraction in medical images, the proposed EGM was compared with the Prewitt operator and the segment anything model (SAM) [38]. The Prewitt operator detects edges by computing gradient values along the horizontal and vertical directions. SAM is a comprehensive model proposed by ref. [38] and designed for general segmentation tasks in computer vision. From the results listed in Table 7, it is evident that the EGM significantly enhanced the anomaly detection performance, whereas the Prewitt operator and SAM showed inferior performance owing to the complexity and subtlety of medical images. Table 7. Comparative results of the different edge extraction methods on the breast ultrasound image, brain magnetic resonance imaging, and head computed tomography datasets (%)MethodDET.AUROCSeg.AUROCSeg.AUPROBUSI datasetPrewitt85.5469.4256.25SAM [38]87.3472.8857.05EGM93.4674.9960.47****Brain MRI datasetPrewitt56.5666.0741.73SAM [38]92.6673.0854.24EGM88.7185.3657.47Head CT datasetPrewitt79.4078.8544.67SAM [38]79.4083.8256.28EGM81.1585.0858.97****Bold values indicate the best performance (the highest DET.AUROC, Seg.AUROC, or Seg.AUPRO value) for each dataset among the compared edge extraction methods
The Prewitt operator offers simplicity and ease of use; however, it is limited by its reliance on local difference extraction, which makes it unable to capture intricate and subtle edge features in medical images. Moreover, its sensitivity to noise often leads to inaccuracies in edge detection results. Similarly, the SAM excels as a pre-trained segmentation model for general images; however, it faces challenges in unsupervised anomaly detection for medical images. Medical images often contain complex and noisy backgrounds outside the training scope of the SAM. Furthermore, in unsupervised learning scenarios, the SAM lacks explicit labels and struggles to effectively detect image edges. In contrast, the EGM is designed for the specific characteristics and demands of medical images, with guidance from prior edge information; therefore, it is better suited for handling the complexities of medical image analysis.
Different edge aggregation methods Edge aggregation is a crucial step in processing and integrating edge information, merging and enhancing the extracted edge details to improve segmentation accuracy and robustness. To implement edge information integration, the EAM was introduced to mine low-level appearance and high-level semantic information and obtain aggregated features, thereby enhancing segmentation performance. To demonstrate the effectiveness of the EAM, three different edge feature fusion methods were compared: direct addition (ADD), direct multiplication (MUL), and the EAM. The performance results of these methods are listed in Table 8. Table 8. Comparative results of the different edge aggregation methods on the breast ultrasound image, brain magnetic resonance imaging, and head computed tomography datasets (%)MethodDET.AUROCSeg.AUROCSeg.AUPROBUSI datasetADD88.2775.2461.66MUL87.3273.7561.32EAM94.1675.2962.68****Brain MRI datasetADD88.1282.5457.42MUL81.6280.2156.06EAM89.9686.0258.37****Head CT datasetADD80.1089.2163.53MUL76.0087.2458.78EAM82.1585.9860.70****Bold values indicate the best performance (the highest DET.AUROC, Seg.AUROC, or Seg.AUPRO value) for each dataset among the compared edge aggregation methods (ADD, MUL, EAM)
The experimental results in Table 8 show that simple fusion methods, such as addition and multiplication, do not fully exploit the edge information, resulting in poor detection and segmentation outcomes. In contrast, the EAM significantly enhanced the anomaly detection performance of the network. Specifically, the addition operation is overly simplistic, merely adding two groups of features pixel-by-pixel, which can lead to overlapping or missing information. The multiplication operation directly multiplies two groups of features pixel-by-pixel, which suppresses some features because multiplication reduces the variability between features. Conversely, EAM is more effective as it mines edge information with greater detail through the attention mechanism and residual connection to significantly improve the segmentation performance.
Discussion
Based on the comparison experiments and ablation studies, the proposed UMIAD-EGMF successfully handled biomedical image anomaly detection with an excellent performance. By effectively integrating the EGM, EAM, and MFF, the proposed UMIAD-EGMF performed better than other SOTA methods, even though different types of images appeared with complex shapes and varying scales. Most of the existing methods cannot perform accurate anomaly detection, particularly for issues such as blurred edges and varying scales of abnormal regions.
To address these issues, a new unsupervised method for medical image anomaly detection (UMIAD-EGMF) was proposed. UMIAD-EGMF extracts richer edge information with scale adaptability and progressively identifies discriminative information for anomaly detection in medical images. The characteristics of UMIAD-EGMF are as follows: it not only captures contextual information around anomaly boundaries by fusing low-level image features, to enhance the boundary details of abnormal regions via the EGM, but also efficiently integrates the edge information extracted by the EGM into deeper features through the EAM. In addition, UMIAD-EGMF can further merge feature maps at different scales via the MFF module. This allows it to capture common anomaly features, including both subtle and significant multi-dimensional information.
Conclusions
To address the blurred edges and varying scales of abnormal regions in medical images, this study extended the normalizing flow framework to unsupervised medical image anomaly detection and proposed a novel method (UMIAD-EGMF) tailored for this task. Recognizing the critical role of boundary information in detection and segmentation, a medical image EGM was designed to capture contextual information around anomalous boundaries. This module is integrated into the network to enhance boundary refinement. To effectively fuse structural image features with deep semantic features, an EAM was introduced that combines multi-level features to improve segmentation accuracy. In addition, an MFF module was proposed that exchanges local and global information based on multi-scale pyramid features. This module addresses lesions of varying sizes, thereby enabling the better detection and segmentation of abnormal regions. This study evaluated the proposed UMIAD-EGMF on public medical image anomaly detection datasets, demonstrating improved accuracy and robustness in anomaly detection. However, this study has limitations, as it did not account for common clinical data variations and noise disturbances, such as lighting changes, motion blur, and imaging artifacts. Future work should further model information variation rules using dynamic network architectures to ensure the reliability of the algorithm in the complex environments of real-world clinical applications.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Arjovsky M, Chintala S, Bottou L (2017) Wasserstein generative adversarial networks. In: Proceedings of the 34th international conference on machine learning, PMLR, Sydney, 6–11 August 2017. https://dl.acm.org/doi/abs/10.5555/3305381.3305404
- 2Zhou YX, Xu X, Song JK, Shen FM, Shen HT (2025) MS Flow: multiscale flowbased framework for unsupervised anomaly detection. IEEE Trans Neural Netw Learn Syst 36(2):2437–2450. 10.1109/TNNLS.2023.334411810.1109/TNNLS.2023.334411838194384 · doi ↗ · pubmed ↗
- 3Rezende DJ, Mohamed S (2015) Variational inference with normalizing flows. In: Proceedings of the 32nd international conference on machine learning, JMLR.org, Lille, 6–11 July 2015. https://proceedings.mlr.press/v 37/rezende 15
- 4Al-Dhabyani W, Gomaa M, Khaled H, Fahmy A (2020) Dataset of breast ultrasound images. Data Brief 28:104863. 10.1016/j.dib.2019.1048610.1016/j.dib.2019.104863 PMC 690672831867417 · doi ↗ · pubmed ↗
