Adaptive Normalization Enhances the Generalization of Deep Learning Model in Chest X-Ray Classification
Jatsada Singthongchai, Tanachapong Wangkhamhan

TL;DR
This paper shows that adaptive preprocessing improves deep learning model performance for chest X-ray classification by enhancing generalization across datasets.
Contribution
The study introduces an adaptive preprocessing pipeline combining ROI cropping and histogram standardization for improved model generalization.
Findings
The adaptive pipeline improved accuracy and F1-score on datasets with stable contrast characteristics.
Histogram standardization was the main contributor to performance gains, with ROI cropping providing additional benefits.
The method had minimal computational overhead and showed statistically significant improvements.
Abstract
This study presents a controlled benchmarking analysis of min–max scaling, Z-score normalization, and an adaptive preprocessing pipeline that combines percentile-based ROI cropping with histogram standardization. The evaluation was conducted across four public chest X-ray (CXR) datasets and three convolutional neural network architectures under controlled experimental settings. The adaptive pipeline generally improved accuracy, F1-score, and training stability on datasets with relatively stable contrast characteristics while yielding limited gains on MIMIC-CXR due to strong acquisition heterogeneity. Ablation experiments showed that histogram standardization provided the primary performance contribution, with ROI cropping offering complementary benefits, and the full pipeline achieving the best overall performance. The computational overhead of the adaptive preprocessing was minimal…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCOVID-19 diagnosis using AI · AI in cancer detection · Medical Imaging and Analysis
1. Introduction
Chest radiography (CXR) is a widely used imaging modality due to its rapid acquisition, low cost, and non-invasive nature, supporting diagnosis across a broad range of thoracic diseases [1,2]. With advances in artificial intelligence (AI), deep learning-based CXR analysis has shown promise in improving triage efficiency and diagnostic support, particularly in high-volume and resource-limited clinical settings [1,3,4].
Despite these advances, the generalization of deep learning (DL) models across institutions remains limited. Two closely related factors contribute to this challenge. First, resizing high-resolution CXRs to standard network input sizes (e.g., 256 × 256 pixels) reduces the relative lung-to-image ratio, allowing non-diagnostic background structures to dominate learned representations and obscure subtle abnormalities [5,6]. Second, substantial domain shift arises from heterogeneity in imaging devices, acquisition protocols, contrast, brightness, and preprocessing pipelines across institutions, which can significantly degrade model performance when deployed beyond the training domain [7,8].
Image normalization is a central yet under-benchmarked factor in addressing these issues. While conventional strategies such as min–max scaling and Z-score normalization are widely adopted, their effectiveness under cross-dataset variability has rarely been evaluated in a controlled and systematic manner. Prior studies have explored adaptive instance normalization [9], contrastive domain alignment [10,11,12], multimodal integration with clinical metadata [2], and histogram-based or localized intensity normalization techniques [13,14,15,16]. However, these approaches are typically evaluated within single datasets or fixed model architectures, making it difficult to isolate the role of normalization from confounding factors such as dataset scale, model capacity, or training protocol.
To address this gap, this study conducted a systematic benchmarking analysis of three normalization strategies—min–max scaling, Z-score normalization, and an adaptive preprocessing pipeline combining percentile-based ROI cropping with histogram standardization. The evaluation was performed across four public CXR datasets, including CheXpert [17], MIMIC-CXR [18], and ChestX-ray14 [6], using three convolutional neural network architectures: a lightweight CNN, EfficientNet-B0, and MobileNetV2, under controlled sampling and identical training conditions.
This work is positioned as a controlled benchmarking study rather than a methodological innovation. The main contributions are threefold:
- (1)It establishes a controlled cross-dataset and cross-architecture evaluation framework for comparing normalization strategies;
- (2)It quantifies the impact of normalization choices on cross-domain generalization, training stability, and performance consistency, with particular emphasis on lightweight architectures such as MobileNetV2; and
- (3)It provides a statistically grounded comparison using Friedman–Nemenyi and Wilcoxon signed-rank tests to clarify when adaptive normalization yields meaningful performance gains over conventional approaches.
The remainder of this paper is organized as follows. Section 2 reviews related work on normalization and domain generalization in CXR analysis. Section 3 describes the datasets, normalization techniques, and experimental methodology. Section 4 presents the experimental results. Section 5 discusses the implications and limitations of the findings, and Section 6 concludes the study and outlines directions for future work.
2. Background and Related Work
Robust evaluation of normalization strategies for chest X-ray (CXR) classification requires datasets that reflect meaningful diversity in patient populations, acquisition protocols, and imaging conditions. Prior studies have demonstrated that deep learning models for CXR are highly sensitive to domain shifts introduced by heterogeneity in imaging sources and preprocessing pipelines [7,8,19]. Accordingly, the datasets and preprocessing techniques reviewed in this section provide the foundation for cross-dataset benchmarking of normalization methods.
2.1. Datasets
2.1.1. ChestX-ray14
ChestX-ray14 contains over 112,000 frontal CXR images from more than 30,000 patients, labeled with fourteen thoracic disease categories [6,20]. Due to label noise arising from automated report mining, it is commonly used to evaluate model robustness under imperfect supervision [8,21].
2.1.2. CheXpert
CheXpert includes more than 220,000 images from approximately 65,000 patients and explicitly models diagnostic uncertainty in its labels [17]. This design supports the evaluation of calibration, robustness, and uncertainty-aware learning in CXR classification [15,22].
2.1.3. MIMIC-CXR
MIMIC-CXR comprises over 370,000 CXR images collected across multiple departments and imaging devices, introducing substantial acquisition heterogeneity [18]. This diversity makes it a key benchmark for assessing cross-domain generalization in realistic clinical settings [23].
2.1.4. Pediatric Chest X-Ray (Kermany Dataset)
The pediatric CXR dataset contains 5863 images labeled as normal, bacterial pneumonia, or viral pneumonia [24]. Despite its smaller scale, it represents a distinct anatomical domain and is frequently used to study transfer learning and adult-to-pediatric generalization [25,26].
2.2. Preprocessing Techniques
Preprocessing is essential for mitigating domain shift caused by variations in anatomy, acquisition parameters, and institutional imaging protocols [7,11,27]. Among preprocessing strategies, normalization plays a central role in stabilizing intensity distributions, while data augmentation provides complementary benefits through geometric and intensity transformations [28].
2.2.1. Normalization
Normalization aims to reduce intensity variability induced by scanner differences, exposure settings, and patient conditions. Conventional techniques such as min–max scaling and Z-score normalization remain widely used due to their simplicity and compatibility with convolutional architectures [29]. However, their effectiveness is limited in multi-source settings with heterogeneous contrast distributions.
Recent work has explored adaptive and spatially aware normalization schemes, including adaptive instance normalization and localized histogram-based methods, which demonstrate improved robustness in cross-domain and self-supervised learning pipelines [14,16,30]. These studies highlight the importance of region-focused intensity transformations for heterogeneous CXR datasets.
2.2.2. Min–Max Scaling as a Baseline
Min–max scaling linearly maps pixel intensities to a fixed range and is commonly used as a baseline due to its low computational cost [31,32]. Nevertheless, it performs poorly under cross-dataset variability because it does not account for differences in underlying intensity distributions [33,34].
2.2.3. Z-Score Normalization as a Standard Baseline
Z-score normalization standardizes intensities to zero mean and unit variance and has demonstrated greater resilience to device heterogeneity than min–max scaling in multi-center studies [35]. It is also associated with improved calibration and reduced performance variance in domain-adaptive frameworks [13,36].
2.3. Model Architectures
To assess normalization effects under realistic deployment constraints, three convolutional architectures with varying computational complexity are commonly adopted: lightweight CNNs for resource-constrained environments, EfficientNet-B0 for balanced accuracy and efficiency, and MobileNetV2 for mobile and embedded applications [37,38]. Training these architectures under identical optimization settings enables the isolation of normalization effects from architectural factors.
2.4. Region of Interest and Signal-to-Noise Ratio
Resizing high-resolution CXRs to standard input resolutions can reduce the lung-to-image ratio and obscure subtle pulmonary findings [1,6]. ROI-based cropping techniques mitigate this effect by preserving diagnostically relevant regions. Percentile-based cropping defined in relative coordinate space has been shown to provide stable anatomical coverage across datasets with varying resolutions and aspect ratios, supporting improved cross-dataset generalization [11].
2.5. Domain Adaptation and Histogram Standardization
Domain shifts caused by inter-dataset variability in brightness and contrast can significantly degrade model performance [11,18]. Histogram standardization aligns image intensity statistics to a reference distribution and has been shown to reduce inter-dataset variability, particularly when combined with region-focused preprocessing strategies [16,39].
2.6. Comparative Analysis of Related Work
Recent studies emphasize that normalization choices substantially influence cross-domain robustness in CXR classification [35,36]. While advances in segmentation-driven pipelines, federated learning, and model compression address complementary challenges [15,40,41], systematic comparisons of normalization strategies across datasets and architectures remain limited. This gap motivates the benchmarking framework adopted in this study.
Building on this body of work, our study presents a systematic benchmarking of normalization strategies across multiple datasets and CNN architectures, as summarized in Table 1.
2.7. Transformer and Foundation Model Approaches
Transformer-based and foundation model approaches, including vision–language pretraining and hybrid CNN–transformer architectures, have demonstrated strong transfer and zero-shot performance in CXR analysis [22,48,49,50,51]. Although this study focused on convolutional backbones for controlled benchmarking, the preprocessing issues examined—ROI selection, intensity normalization, and cross-domain robustness, remain directly relevant to transformer-based and foundation model pipelines, which also depend on stable input distributions for reliable downstream performance.
3. Methodology
3.1. Dataset Description
This study utilized four publicly available chest X-ray datasets that represent diverse imaging conditions, patient populations, and diagnostic labels. Three large-scale adult datasets, ChestX-ray14, CheXpert, and MIMIC-CXR, were uniformly sampled to 16,000 images each in order to formulate a controlled evaluation environment where normalization effects can be compared under identical sample sizes. The pediatric Chest-X-ray-Pneumonia dataset contains 5863 images, and the entire set was included due to its smaller size. For the pediatric dataset, class-proportion balancing was applied to mitigate label imbalance before training.
Standardizing the dataset sizes helps ensure that performance differences across normalization methods are not influenced by variations in dataset scale. This controlled sampling strategy follows recommendations from prior evaluation frameworks that emphasize reproducibility and statistical fairness when comparing preprocessing approaches in medical imaging [7,35].
A summary of the datasets used in this study is provided in Table 2.
3.2. Image Preprocessing Techniques
Normalization is a critical component in stabilizing the training dynamics and improving the generalization of deep learning models for chest X-ray (CXR) analysis. Variations in acquisition devices, patient anatomy, and exposure settings introduce substantial heterogeneity across datasets, making preprocessing essential for cross-domain robustness [52]. In this study, three normalization techniques were evaluated systematically: scaling normalization, Z-score normalization, and the proposed adaptive normalization. Representative examples are shown in Figure 1, Figure 2 and Figure 3.
3.2.1. Scaling Normalization
Scaling normalization linearly remaps pixel intensities to a fixed range, typically 0–1 after resizing. A representative example is shown in Figure 1. This method is computationally lightweight and commonly used when imaging conditions are relatively homogeneous [1,53].
However, scaling has notable limitations in multi-institutional CXR settings:
- It does not correct local contrast variations and is sensitive to outliers caused by acquisition artifacts or metallic implants.
- It fails to harmonize intensity distributions across scanners, leading to degraded cross-domain generalization [16,39].
Accordingly, scaling normalization was included in this study solely as a baseline preprocessing strategy.
Pixel intensities were linearly rescaled from the original 0–255 range to 0–1 following spatial resizing.
3.2.2. Z-Score Normalization
Z-score normalization standardizes image intensities to zero mean and unit variance (Figure 2), reducing brightness and contrast variations caused by heterogeneous imaging devices and patient populations [29,54]. As a result, it is well-suited for multi-center CXR datasets.
A known limitation is its reliance on approximately Gaussian intensity distributions, which are not always present in clinical CXRs [7]. Nevertheless, Z-score normalization consistently outperforms simple scaling in cross-domain evaluations.
3.2.3. Adaptive Normalization (Proposed Method)
The proposed adaptive normalization method addresses both spatial and intensity variability by combining CDF-guided ROI cropping with histogram standardization. This design is motivated by evidence that localized intensity correction improves diagnostic feature visibility and reduces device-induced domain shifts [16,39,52].
A complete visualization of the pipeline is provided in Figure 3.
Step 1: ROI Localization via CDF-Guided Cropping
To reduce background structures that do not contribute to diagnosis, the cumulative distribution function (CDF) of grayscale-sum values is computed:
- Horizontal (x-axis)—The ROI is extracted between the 5th and 95th percentiles, removing low-density lateral regions that predominantly contain background. This range is selected based on empirical consistency across adult and pediatric CXRs and aligns with findings that lateral regions contribute minimal diagnostic information.
- Vertical (y-axis)—The ROI is retained between the 15th and 95th percentiles, which excludes anatomical noise above the clavicle and reduces variability caused by neck and shoulder structures.
In exploratory experiments across ChestX-ray14, CheXpert, MIMIC-CXR, and Chest-Xray-Pneumonia, these percentile ranges consistently preserved lung apices and costophrenic angles while excluding most neck and shoulder structures. Percentile-based cropping is also more robust to variations in resolution and aspect ratio than fixed-pixel cropping, which facilitates cross-dataset generalization without manual retuning of crop coordinates.
These percentile thresholds were refined through exploratory analysis and are consistent with established approaches for lung-focused cropping in prior studies [39].
Step 2: Histogram-Based Intensity Standardization
To harmonize contrast across datasets with differing exposure characteristics, image histograms were standardized using target statistics derived from normal ChestX-ray14 images:
- Target mean: μ_target_ = 0.4776 × 255 ≈ 121.8;
- Target standard deviation: σ_target_ = 0.2238 × 255 ≈ 57.1.
Standardization was applied as defined in Equation (1).
where
: The intensity value of the original image at pixel ;
: Mean intensity of the original image;
: Standard deviation of the original image;
: Target standard deviation;
: Target mean;
: Normalized pixel value at position .
This ensures consistent luminance and contrast across heterogeneous datasets, addressing a key source of domain shift [16].
3.2.4. Summary
Together, these preprocessing techniques ensure that model comparisons are not biased by input variability. While scaling and Z-score normalization provide useful baselines, the proposed adaptive method uniquely addresses both anatomical and intensity-level heterogeneity. By incorporating biologically grounded ROI localization and distribution-aware intensity standardization, the preprocessing pipeline enhances fairness, reproducibility, and cross-domain robustness in clinical deep-learning workflows [52].
3.3. Deep Learning Model Architecture
Three convolutional neural network (CNN) architectures were selected to evaluate how different normalization strategies influence performance and generalization: a custom lightweight CNN, EfficientNet-B0, and MobileNetV2. These models represent a spectrum of computational complexity and capacity, which enables a fair and systematic assessment of preprocessing effects across low, medium, and high expressive architectures. Their selection aligns with widely used CNN-based diagnostic pipelines in chest radiography research [20,42,52].
3.3.1. Custom Lightweight CNN
The custom CNN serves as a controlled baseline for isolating the effect of normalization. Its architecture consists of three convolutional layers with ReLU activation, followed by max-pooling, two fully connected layers, and a Softmax output layer for binary classification. With approximately 1.2 million trainable parameters, the model is intentionally lightweight, enabling rapid training and interpretability. This type of architecture has been recommended for applications in resource-limited settings and embedded diagnostic workflows [21,34]. Figure 4 illustrates the structure of this CNN model.
3.3.2. EfficientNet-B0
EfficientNet-B0 employs compound scaling and squeeze-and-excitation blocks, yielding strong performance with a relatively small parameter footprint. Its robustness in thoracic disease classification and cross-domain generalization has been demonstrated in large-scale radiology studies, making it suitable for evaluating normalization under realistic multi-institutional variation [13,37].
3.3.3. MobileNetV2
MobileNetV2 uses inverted residual blocks with depthwise separable convolutions, optimizing it for low-latency inference on embedded and mobile devices. Despite being lightweight, its performance is highly sensitive to preprocessing quality, particularly when trained on heterogeneous public datasets [5,55,56]. Including MobileNetV2 therefore allows the study to evaluate how normalization affects compact architectures deployed in real-time clinical environments.
3.3.4. Model Training Framework
All models were implemented in PyTorch (version 3.13.5) and trained under identical optimization settings to ensure that performance differences arise solely from the normalization methods. Batch-based stochastic training, cross-entropy loss, and unified tracking of accuracy and F1-score were applied across all architectures. The detailed training workflow is summarized using pseudocode in Section 3.4, replacing the previous code listing in accordance with the Reviewer’s recommendations.
3.4. Experimental Design
The experimental design was structured to evaluate how different normalization strategies affect chest X-ray classification performance under controlled and comparable conditions. An overview of the experimental pipeline is illustrated in Figure 5.
Four publicly available datasets were considered: ChestX-ray14, CheXpert, MIMIC-CXR, and Chest-Xray-Pneumonia. These datasets differ in patient demographics, acquisition protocols, and imaging devices, making them suitable for cross-domain evaluation [17,18,24].
Each dataset was preprocessed using one of three normalization strategies: scaling normalization, Z-score normalization, and adaptive normalization. Three convolutional neural network architectures were evaluated: a custom lightweight CNN, EfficientNet-B0, and MobileNetV2. This resulted in 36 experimental configurations (4 datasets × 3 normalization methods × 3 models).
All experiments were conducted under identical optimization settings to ensure a fair comparison. Patient-level stratified sampling was applied using an 80% training and 20% validation split. To improve robustness, each configuration was evaluated over three independent runs corresponding to different fixed random seeds (42, 123, and 456), and the reported results represent the mean performance across these runs [57,58]. The detailed training procedure is described using pseudocode in Section 3.4.3.
3.4.1. Training Hyperparameters
All models were trained using identical optimization settings. The hyperparameters used in all experiments are summarized in Table 3.
The training workflow is summarized using pseudocode in Section 3.4.3, replacing the previous code listing in accordance with the Reviewer’s recommendations.
3.4.2. Data Augmentation
To avoid confounding effects on intensity normalization, only mild geometric data augmentation was applied uniformly across all experiments. The augmentation strategy included horizontal flipping, small rotations (±7°), isotropic scaling (0.9–1.1), and translation shifts of up to 5% of the image dimensions.
No brightness, contrast, gamma, or intensity jitter was applied to ensure that intensity statistics remained unaffected by augmentation. This strategy follows established recommendations in medical imaging, where geometric transformations improve robustness without altering radiographic intensity patterns [28,46,47].
3.4.3. Training Workflow Pseudocode
The unified training workflow applied to all model and preprocessing combinations is summarized in Algorithm 1. This pseudocode provides a conceptual description of the optimization process and replaces the previous implementation-level code listing, in accordance with the Reviewer’s recommendations. Algorithm 1. Training Workflow for CXR ClassificationInput Preprocessed training images Preprocessed validation images Neural network model M Hyperparameters from Table 3Output Trained model parametersProcedure Initialize model M with random weights For each epoch in the allowed maximum number of epochs
Set model M to training mode
For each batch in the training dataset
Load batch images and labels
Perform forward pass to obtain predictions
Compute cross entropy loss
Compute gradients through backpropagation
Update model parameters using the Adam optimizer
End batch loop
Set model M to evaluation mode
Compute accuracy and F1 score on the validation dataset End epoch loopReturn The final trained model M
This pseudocode ensures that all models are optimized under identical conditions so that performance differences arise exclusively from the normalization methods.
3.5. Evaluation Metrics and Performance Formulas
3.5.1. Accuracy
Accuracy measures the proportion of correctly classified samples among all predictions. Although widely used, accuracy may become unreliable when dealing with imbalanced medical datasets, which is frequently observed in chest X-ray classification [8]. For this reason, accuracy is reported together with more stable metrics to ensure a balanced evaluation. The formula is provided in Equation (2).
3.5.2. F1-Score
The F1-score is the harmonic mean of precision and recall. It provides a balanced perspective by considering both false positives and false negatives. F1-score is particularly valuable in medical imaging because many datasets contain imbalanced class distributions with fewer positive cases [8]. Its definition appears in Equation (3), with the expressions for precision and recall shown in Equation (4).
where
In this study, the F1-score is a central evaluation metric, especially for imbalanced datasets such as CheXpert and MIMIC-CXR. To improve reliability, In this study, we report the mean and standard deviation of F1-scores obtained from multiple independent runs using different fixed random seeds.
3.5.3. Sensitivity and Specificity
Sensitivity reflects the proportion of abnormal cases correctly detected by the model. Specificity reflects the proportion of normal cases correctly identified. These measures complement accuracy and F1-score by providing a more detailed analysis of performance under class imbalance, which is crucial for clinical reliability [8]. The formulas are given in Equation (5).
Sensitivity helps evaluate the model’s ability to detect disease, while specificity measures its capability to avoid false alarms. Together, they provide additional insight into the diagnostic behavior of the models across datasets with varying prevalence rates.
3.6. Statistical Significance Testing
To assess whether the performance differences observed across normalization methods and model architectures were statistically meaningful, a three-stage non-parametric testing procedure was applied. This approach follows established recommendations for model comparison in medical image analysis, where performance metrics typically violate assumptions of normality [8,42].
First, the Friedman test was used to examine overall differences across all normalization–model combinations evaluated on multiple datasets. The test is appropriate for repeated-measures settings in which identical classifiers are compared across several conditions without assuming Gaussian distributions [8]. The Friedman test yielded significant results with p-values below 0.05, indicating that at least one method differed from the others.
Following this outcome, the Nemenyi post hoc test was applied to identify which pairs of normalization approaches exhibited statistically significant differences. The Nemenyi test is recommended for pairwise comparisons after Friedman analysis, particularly for experiments involving multiple algorithms evaluated under identical settings [22].
Finally, the Wilcoxon signed-rank test was employed to provide fine-grained paired comparisons of F1-scores between preprocessing methods under identical model architectures and data splits. This test is widely recommended for paired evaluations in cross-validation pipelines and is robust to the non-normal distribution of performance scores [8,59].
The results of the Wilcoxon test are summarized in Table 4.
This statistical analysis further supports the robustness of the observed performance differences, suggesting that the improvements associated with adaptive normalization are unlikely to be attributable to random variation.
4. Experimental Results
This section reports a comprehensive evaluation of the three normalization strategies, scaling, Z-score, and adaptive normalization, across four chest X-ray datasets and three convolutional architectures. Composite figures (Figure 6, Figure 7 and Figure 8) and reorganized performance tables (Table 5, Table 6 and Table 7) were used to summarize the validation accuracy, loss, F1-score, and interaction effects between normalization and model design.
4.1. Accuracy Analysis
Table 5 and Figure 6 summarize the validation accuracy across datasets and model architectures. Across three of the four datasets, adaptive normalization generally maintained or improved accuracy relative to scaling and Z-score normalization, particularly on ChestX-ray14, CheXpert, and Chest-Xray-Pneumonia.
MIMIC-CXR constitutes the main exception, where adaptive normalization yielded slightly lower accuracy than Z-score normalization, reflecting the dataset’s substantial acquisition and intensity heterogeneity.
Overall, the accuracy results indicate that adaptive normalization improves cross-dataset generalization when contrast characteristics are relatively stable, while its effectiveness remains dataset-dependent.
Each block corresponds to one dataset (ChestX-ray14, CheXpert, MIMIC-CXR, Chest X-ray Pneumonia). Bars show the mean validation accuracy over three independent runs for each model (CNN, EfficientNet-B0, MobileNetV2) and normalization strategy (Scaling, Z-score, Adaptive). “Adaptive” refers to the pipeline of CDF-based ROI cropping followed by histogram standardization using the fixed target mean and standard deviation; “Scaling” corresponds to min–max intensity scaling; and “Z-score” uses global dataset-level statistics.
4.2. Loss Analysis
Validation loss results (Table 6 and Figure 7) largely mirror the accuracy trends. In most configurations, adaptive normalization achieved lower or comparable validation loss compared with scaling and Z-score normalization, indicating more stable optimization behavior.
For MIMIC-CXR, adaptive normalization did not provide a consistent loss advantage, again highlighting the challenges posed by strong inter-institution variability.
These findings suggest that adaptive normalization generally stabilizes training dynamics, although its benefits are constrained in datasets with highly heterogeneous acquisition conditions.
Each block corresponds to one dataset (ChestX-ray14, CheXpert, MIMIC-CXR, Chest X-ray Pneumonia). Bars represent the mean validation loss over three runs for each combination of model (CNN, EfficientNet-B0, MobileNetV2) and normalization strategy (Scaling, Z-score, Adaptive). Lower loss values indicate better calibration stability and convergence.
4.3. F1-Score Analysis
Table 7 and Figure 8 report F1-scores across all configurations, providing a balanced evaluation under class imbalance. Adaptive normalization achieved the highest F1-scores on datasets with more consistent intensity characteristics, with the strongest gains observed on Chest-Xray-Pneumonia.
In contrast, Z-score normalization remained competitive or superior on MIMIC-CXR, where broader distributional shifts reduced the effectiveness of histogram-based standardization.
Statistical testing using the Wilcoxon signed-rank test confirmed that adaptive normalization significantly outperformed scaling and Z-score normalization in most dataset–model combinations (p < 0.01), except for MIMIC-CXR.
Each block corresponds to one dataset (ChestX-ray14, CheXpert, MIMIC-CXR, Chest X-ray Pneumonia). Bars represent the mean validation F1-score over three runs for each combination of model (CNN, EfficientNet-B0, MobileNetV2) and normalization strategy (Scaling, Z-score, Adaptive). Higher F1-scores indicate better classification performance and calibration.
Each row shows one test image that was misclassified under Z-score normalization but correctly classified after applying the proposed adaptive normalization. The left column presents the original CXR after Z-score normalization, where low contrast and background dominance contribute to incorrect predictions. The right column shows the same images after CDF-guided cropping and histogram standardization, which enhanced the visibility of lung fields and reduced non-diagnostic background variation. Ground-truth labels and model predictions from MobileNetV2 are indicated in each panel.
Representative examples of misclassified Chest X-ray images before and after normalization are shown in Figure 9.
4.4. Ablation Study: Effect of Cropping and Histogram Standardization
To isolate the contribution of each component in the proposed adaptive normalization pipeline, we conducted an ablation study comparing four conditions: (A) Z-score, (B) cropping only, (C) histogram standardization only, and (D) the full adaptive pipeline. Results on the Chest-Xray-Pneumonia dataset using MobileNetV2 are summarized in Table 8. In this ablation, Z-score normalization serves as the baseline condition, representing input images without either adaptive component (cropping or histogram standardization). For clarity, conditions (A)–(D) correspond directly to the four rows in Table 8.
Histogram standardization produced the largest standalone improvement, increasing the F1-score from 0.85 to 0.88 by harmonizing contrast and suppressing global intensity variability. Cropping yielded a smaller but consistent benefit (0.86) by increasing the lung-to-image ratio and removing non-diagnostic background regions. The full adaptive pipeline achieved the highest performance (0.89), confirming that cropping and histogram standardization are complementary rather than redundant. These results indicate that both components contribute meaningfully to the overall gain, with histogram standardization exerting a stronger individual effect for lightweight architectures such as MobileNetV2.
4.5. Interaction Between Architecture and Normalization
MobileNetV2 exhibited the strongest synergy with adaptive normalization, consistently outperforming or matching the other architectures on datasets with clearer and more uniform intensity characteristics such as ChestX-ray14, CheXpert, and Chest-Xray-Pneumonia. This complementarity arises from several architectural features that are particularly responsive to intensity standardization.
First, depthwise-separable convolutions are highly sensitive to fluctuations in local pixel distribution. Adaptive normalization reduces these fluctuations through region-focused cropping and histogram standardization, leading to more consistent feature activation. Second, the linear bottlenecks in MobileNetV2 benefit from reduced background variation, enabling the model to emphasize diagnostically meaningful structures within the lung fields. The adaptive cropping step further reinforces this effect by focusing the input on anatomically relevant regions.
In datasets where contrast characteristics are relatively stable, these factors collectively support smoother optimization, more stable validation curves, and improved detection of lung opacity patterns. This behavior is also clinically meaningful, as consistent contrast normalization improves the visibility of subtle radiographic abnormalities.
In contrast, for datasets with substantial acquisition variability such as MIMIC-CXR, the advantages of adaptive normalization are less apparent, reflecting the challenges posed by broad cross-institution intensity shifts. Overall, these findings highlight the importance of aligning normalization strategies with model architectures that are particularly receptive to intensity standardization, especially in medical imaging tasks affected by domain variability.
5. Discussion
The experimental results demonstrate that adaptive normalization improves or maintains validation performance relative to scaling and Z-score normalization on three of the four benchmark datasets. This finding is consistent with prior evidence highlighting the importance of intensity normalization and local contrast enhancement in medical image analysis, particularly in multi-center or multi-device settings [7,29,35,39]. By combining percentile-based ROI cropping with histogram standardization, the pipeline reduces non-diagnostic background variation while preserving diagnostically relevant lung structures, leading to more consistent model behavior under moderate domain shifts.
The contrast between the substantial gains observed on ChestX-ray14, CheXpert, and Chest-Xray-Pneumonia and the limited improvements on MIMIC-CXR highlights an important limitation. MIMIC-CXR exhibited pronounced variability in acquisition settings and device characteristics [17,18], weakening the assumptions required for effective histogram-based intensity standardization. Similar observations have been reported in multi-site MRI and radiomics studies, where no single normalization technique consistently dominates under extreme inter-site heterogeneity [7,29,35]. These findings indicate that normalization strategies should be selected according to the expected degree of domain variability rather than applied uniformly across all deployment scenarios. In addition, because pediatric thoraces are proportionally smaller, fixed percentile thresholds may remove relatively larger apical regions, representing a potential source of anatomical bias when adult-derived cropping parameters are applied to pediatric CXRs. This study did not include a dedicated pediatric subgroup analysis; therefore, future work should explicitly validate and, if necessary, re-tune cropping parameters for pediatric populations.
From an architectural perspective, MobileNetV2 combined with adaptive normalization yielded the most reliable performance on datasets with relatively uniform intensity characteristics. This behavior is consistent with the known sensitivity of depthwise-separable convolutions and linear bottlenecks to input distribution consistency [55]. In lightweight architectures, suppressing irrelevant background variation allows representational capacity to focus on subtle parenchymal abnormalities, which is critical for detecting thoracic diseases such as pneumonia and COVID-19 in resource-limited settings [19,20,34]. These results support the view that preprocessing pipelines and model architectures should be co-designed to achieve optimal generalization.
The evaluation framework employed in this study follows established recommendations for rigorous assessment in medical imaging. Reporting F1-scores alongside complementary metrics provides a more reliable evaluation under class imbalance [60]. Statistical significance was assessed using non-parametric tests, including the Friedman–Nemenyi and Wilcoxon signed-rank tests, which are widely endorsed for classifier comparison across multiple datasets [61]. The resulting p-values indicate that the improvements achieved by adaptive normalization were both numerically and statistically meaningful in most scenarios.
Two primary clinical implications emerge from these findings. First, improved robustness under moderate domain shift supports the deployment of CXR-based decision-support systems across institutions with heterogeneous acquisition pipelines, complementing recent advances in efficient and multi-label CXR modeling [6,22,26,45]. Second, the computational efficiency of MobileNetV2, even with the additional preprocessing step, enables near-real-time inference suitable for triage or screening workflows in constrained healthcare environments [34,41].
Adaptive normalization introduced a modest computational overhead. Empirically, the pipeline increased training time by approximately 5.2 ms per batch (+6.3%), with 3.5 ms attributed to CDF-based ROI cropping and 1.7 ms to histogram standardization. This overhead is negligible for offline training and does not affect inference latency because normalization is applied once during preprocessing. Although acceptable in the present setting, such overhead may become relevant for large-scale pretraining or continuous-learning pipelines. Moreover, this study evaluated only four publicly available datasets; performance on unseen clinical cohorts from new scanners, populations, or institutions remains to be systematically assessed. No clinician-in-the-loop assessment or external prospective validation was conducted; therefore, conclusions are limited to retrospective evaluations on publicly available benchmarks.
These limitations motivate several future research directions. Integrating adaptive normalization into federated or source-free domain adaptation frameworks may further improve cross-site robustness while preserving data privacy [14,15]. In addition, combining the pipeline with self-supervised or contrastive pretraining on large-scale unlabeled radiograph repositories could enhance sample efficiency and downstream performance [21,57,58]. Finally, explainability techniques such as Grad-CAM may help elucidate how normalization reshapes salient features, supporting interpretability and clinical adoption [13,56].
Overall, our cross-dataset evaluation reinforces that preprocessing is a critical component of clinically deployable deep learning pipelines. When aligned with dataset characteristics and model architecture, adaptive normalization provides a practical approach for improving generalization in chest X-ray classification while maintaining computational efficiency.
6. Conclusions
This study examined how three preprocessing strategies, scaling, Z-score normalization, and a proposed adaptive normalization pipeline, affect the generalization performance of deep learning models for chest X-ray classification across four benchmark datasets and three convolutional architectures. The findings show that adaptive normalization consistently improves or maintains validation accuracy, loss, and F1-score relative to conventional methods on three of the four datasets, with the strongest performance gains observed when combined with the MobileNetV2 architecture. These improvements reflect more consistent localization of the lung fields and better standardization of intensity distributions, which collectively contribute to more stable optimization and reduced overfitting.
The limited improvements observed on MIMIC-CXR highlight that the effectiveness of normalization depends strongly on the underlying acquisition heterogeneity. In datasets characterized by substantial cross-institution variation, global histogram-based adjustments alone are insufficient, underscoring the need for normalization strategies that explicitly account for broad domain shifts.
The contributions of this study are threefold. First, it provides a cross-dataset, cross-architecture benchmarking framework for systematically evaluating normalization strategies in chest X-ray classification. Second, it offers a statistically grounded comparison using non-parametric tests across multiple metrics, demonstrating that the performance differences are both numerically and statistically meaningful. Third, it demonstrates that an efficient adaptive normalization pipeline integrates well with lightweight architectures such as MobileNetV2, making it suitable for deployment in resource-constrained clinical environments.
Future work will extend this pipeline to more diverse real-world cohorts and explore its integration with federated and source-free domain adaptation frameworks, as well as self-supervised pretraining on large unlabeled CXR collections. Another important direction is to combine adaptive normalization with explainable AI techniques to visualize how preprocessing influences salient regions, thereby facilitating clinical validation and trustworthy deployment of deep learning models in radiographic diagnostics.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Oh Y. Park S. Ye J.C. Deep Learning COVID-19 Features on CXR Using Limited Training Data Sets IEEE Trans. Med. Imaging 2020392688270010.1109/TMI.2020.299329132396075 · doi ↗ · pubmed ↗
- 2Padmavathi V. Ganesan K. Metaheuristic Optimizers Integrated with Vision Transformer Model for Severity Detection and Classification via Multimodal COVID-19 Images Sci. Rep.2025151394110.1038/s 41598-025-98593-w 40263404 PMC 12015488 · doi ↗ · pubmed ↗
- 3Aksoy B. Salman O.K.M. Detection of COVID-19 Disease in Chest X-Ray Images with Capsul Networks: Application with Cloud Computing J. Exp. Theor. Artif. Intell.20213352754110.1080/0952813 X.2021.1908431 · doi ↗
- 4Khan A. Khan S.H. Saif M. Batool A. Sohail A. Khan M.W. A Survey of Deep Learning Techniques for the Analysis of COVID-19 and Their Usability for Detecting Omicron J. Exp. Theor. Artif. Intell.2024361779182110.1080/0952813 X.2023.2165724 · doi ↗
- 5Marikkar U. Atito S. Awais M. Mahdi A. LT-Vi T: A Vision Transformer for Multi-Label Chest X-Ray Classification Proceedings of the 2023 IEEE International Conference on Image Processing (ICIP)Kuala Lumpur, Malaysia 8–11 October 202325652569
- 6Rajpurkar P. Irvin J. Zhu K. Yang B. Mehta H. Duan T. Ding D. Bagul A. Ball R.L. Langlotz C. Che X Net: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learningar Xiv 201710.48550/ar Xiv.1711.052251711.05225 · doi ↗
- 7Demircioğlu A. The Effect of Feature Normalization Methods in Radiomics Insights Imaging 202415210.1186/s 13244-023-01575-738185786 PMC 10772134 · doi ↗ · pubmed ↗
- 8Rayed M.E. Islam S.M.S. Niha S.I. Jim J.R. Kabir M.M. Mridha M.F. Deep Learning for Medical Image Segmentation: State-of-the-Art Advancements and Challenges Inform. Med. Unlocked 20244710150410.1016/j.imu.2024.101504 · doi ↗
