A Priori Prediction of Neoadjuvant Chemotherapy Response in Breast Cancer Using Deep Features from Pre-Treatment MRI and CT

Deok Hyun Jang; Laurentius O. Osapoetra; Lakshmanan Sannachi; Belinda Curpen; Ana Pejović-Milić; Gregory J. Czarnota

PMC · DOI:10.3390/cancers17203394·October 21, 2025

A Priori Prediction of Neoadjuvant Chemotherapy Response in Breast Cancer Using Deep Features from Pre-Treatment MRI and CT

Deok Hyun Jang, Laurentius O. Osapoetra, Lakshmanan Sannachi, Belinda Curpen, Ana Pejović-Milić, Gregory J. Czarnota

PDF

Open Access

TL;DR

This study uses deep learning on MRI and CT scans to predict how breast cancer patients will respond to chemotherapy before treatment begins, potentially allowing for more personalized treatment plans.

Contribution

The novel use of deep learning features from pre-treatment MRI and CT scans to predict neoadjuvant chemotherapy response in breast cancer.

Findings

01

ResNet34 deep features outperformed handcrafted radiomic features in predicting pathologic complete response and treatment response.

02

Deep learning-based models achieved balanced accuracies of 81.6% and 73.5% for predicting pCR and responders, respectively.

03

Combining deep features with clinical data improved prediction of chemotherapy response in breast cancer.

Abstract

Early identification of breast cancer patients who are unlikely to respond to neoadjuvant chemotherapy (NAC) is critical for potentially guiding alternative therapeutic strategies. In this study, routinely acquired pre-treatment MRI and CT scans were analyzed using deep learning-based features in combination with clinical information. Deep features were extracted from intratumoral and peritumoral regions using ResNet architectures pre-trained on large-scale medical imaging datasets. Among the models tested, ResNet34 demonstrated the best performance, exceeding both handcrafted radiomic models and other ResNet backbones. These findings suggest that deep features extracted from standard-of-care imaging can complement established clinical predictors and may facilitate more personalized treatment planning in breast cancer. Background: Response to neoadjuvant chemotherapy (NAC) is a key…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Diseases2

breast cancer tumor

Figures5

Click any figure to enlarge with its caption.

Funding2

—Natural Sciences and Engineering Research Council of Canada
—Terry Fox Research Institute

Keywords

radiomicsMRICTbreast cancerneoadjuvant chemotherapyresponse predictionmachine learningdeep learning

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRadiomics and Machine Learning in Medical Imaging · MRI in cancer diagnosis · AI in cancer detection

Full text

1. Introduction

Breast cancer is a highly heterogeneous disease characterized by significant genetic, phenotypic, and microenvironmental diversity at both the intertumoral and intratumoral levels [1]. This heterogeneity leads to variability in treatment efficacy and patient prognosis. Consequently, breast cancer management employs a multimodal approach that integrates locoregional treatments, such as surgery and radiation therapy, with systemic therapies including chemotherapy, targeted therapy, immunotherapy, and endocrine therapy [2,3]. In the treatment of locally advanced breast cancer (LABC) and high-risk early breast cancer (EBC), chemotherapy is typically administered in the neoadjuvant setting before surgery [4,5]. The aim of neoadjuvant chemotherapy (NAC) is to downstage the tumor, improve operability, and eradicate micrometastases.

Response to NAC refers to tumor reduction following therapy and serves as an indicator of therapeutic efficacy. In particular, pathological complete response (pCR), defined as the complete clearance of invasive carcinoma in the breast and lymph nodes, is associated with improved survival outcomes [6,7]. This effect is more pronounced in human epidermal growth factor receptor 2 (HER2)-enriched and triple-negative breast cancers [8]. Due to its clinical significance, pCR is recognized by the American and European regulatory agencies as a surrogate endpoint for long-term outcomes in randomized clinical trials [9,10]. Nonetheless, the likelihood of achieving pCR is relatively low, as a meta-analysis of 52 studies involving 27,895 patients reported a pCR rate of 21.1% [8]. The majority of patients receiving NAC achieve a partial response, defined as tumor shrinkage with residual invasive disease, which corresponds to intermediate survival outcomes between pCR and stable or progressive disease [11,12].

Given the prognostic significance of NAC response, the early identification of less responsive patients could enable timely treatment modification to improve the likelihood of response and survival outcomes [13]. However, the gold standard for assessing response remains histopathological examination of the surgical specimen, which is only available after the completion of NAC and subsequent surgery. Imaging modalities such as magnetic resonance imaging (MRI) and computed tomography (CT) acquired during NAC may provide insights into tumor response to NAC at an early stage. However, such interim imaging is not widely established as a standard protocol in breast cancer management [2]. As a result, pre-treatment imaging has been increasingly investigated for response prediction using radiomic analysis. Radiomics involves the extraction of high-dimensional quantitative features from medical images, capturing patho-physiological information not discernible by visual assessment [14,15]. In breast cancer, radiomics can effectively quantify intratumoral heterogeneity. The utility of radiomics for predicting breast cancer response has been investigated using various imaging modalities, such as quantitative ultrasound [16,17], CT [18,19,20,21], and MRI [22,23,24].

Conventionally, radiomic analysis has relied on predefined mathematical descriptors, often referred to as handcrafted features, which are subsequently used in machine learning classification. Deep learning offers an alternative that integrates the learning of layered image representations directly from imaging data with classification in a single framework, while eliminating the need for explicit radiomic feature extraction and feature selection [25,26,27]. However, the effectiveness of end-to-end deep learning models is often constrained by the limited size of annotated medical imaging datasets, which can lead to overfitting and poor generalizability [28]. To address this limitation, deep feature extraction has been explored as a hybrid strategy that utilizes intermediate activations from deep networks as input features for traditional machine learning classifiers [27,29]. Transfer learning, which adapts models pretrained on large datasets to new tasks with limited data, can further improve this approach by providing more robust feature representations. In particular, pretrained weights derived from large-scale medical imaging datasets such as MedicalNet [30] have the potential to enhance the stability and generalizability of the extracted deep features. These approaches leverage the ability of convolutional neural network (CNN) to capture complex image patterns that may correspond to underlying biological heterogeneity not represented by handcrafted radiomic features, while avoiding the substantial data requirements and heavy computational burden of fully trainable deep learning models. Therefore, deep feature extraction offers a balanced approach that combines the strengths of both radiomics and deep learning and may provide improved predictive performance in breast cancer response modeling.

This study aimed to predict breast cancer response to NAC using a deep feature extraction framework with transfer learning. Two different binary classifications of response were assessed: pCR versus non-pCR, and clinical response versus non-response. In this context, clinical response was defined to include both complete and partial responders. The analysis utilized routinely acquired pre-treatment scans obtained as part of standard diagnostic work-up, including contrast-enhanced T1-weighted (CE-T1) and T2-weighted (T2) MRI sequences and contrast-enhanced CT images. Intratumoral and peritumoral segmentations were analyzed to capture both tumor-intrinsic and surrounding microenvironmental characteristics relevant to therapy response. Deep feature extraction was implemented using CNNs with transfer learning from pre-trained medical imaging models, and the resulting features were subsequently applied to machine learning classifiers. This approach provides a balance between the interpretability of handcrafted radiomics and the representational capacity of deep learning, while reducing dependency on large, annotated cohorts. Ultimately, incorporating deep features from routine imaging may improve prediction of NAC response and, in turn, support future individualized treatment strategies in breast cancer management.

2. Materials and Methods

2.1. Patient Selection

This study was conducted at Sunnybrook Health Sciences Centre, Toronto, Canada, with institutional research ethics board approval. Patients with biopsy-confirmed breast cancer diagnosed between 2013 and 2019 were reviewed. Eligible cases for imaging data analysis included individuals with LABC or high-risk early-stage disease who underwent standard anthracycline- and taxane-based NAC followed by surgery. Patients were treated with one of two standard NAC regimens. In the first regimen, three cycles of fluorouracil, epirubicin, and cyclophosphamide were administered, followed by three cycles of docetaxel given at three-week intervals. In the alternative dose-dense protocol, four cycles of doxorubicin and cyclophosphamide were administered, followed by four cycles of paclitaxel every two weeks. Patients with HER2-enriched disease were additionally treated with trastuzumab as targeted therapy. Availability of biopsy results, pre-treatment MRI and CT images, and surgical pathology after NAC was required for inclusion in the study. Patients were excluded if images were of insufficient quality, contained artifacts, or had incomplete coverage of the primary tumor. Cases with missing biopsy or surgical pathology reports or with breast implants were also excluded. All patient information was anonymized prior to analysis.

2.2. Clinical Variables

Clinical characteristics were included in the modeling to account for established prognostic and predictive factors in breast cancer. Estrogen receptor (ER), progesterone receptor (PR), and HER2 status were determined from biopsy by immunohistochemistry (IHC) and incorporated as binary variables. Additional variables consisted of patient age at diagnosis, histological grade (categorized as G1, G2, or G3), primary tumor size in millimeters measured on pre-treatment MRI, and clinical nodal status (categorized as N0 to N3 according to AJCC staging). Altogether, seven clinical features were integrated with imaging-derived features for predictive analysis. Age and tumor size were continuous variables, and the remaining features were categorical.

2.3. Response Assessment

Two definitions of response were evaluated as treatment outcomes following NAC. In criterion 1, the response was defined as pCR based on histopathological examination of the surgical specimen, regardless of the presence of ductal carcinoma in situ (DCIS). Therefore, criterion 1 distinguished between pCR and non-pCR, with the latter encompassing both partial response and non-response. In criterion 2, the response was defined as a reduction in tumor size of 30% or greater, based on a modified grading system [31] adapted from the Response Evaluation Criteria in Solid Tumors (RECIST) guidelines [32]. The initial tumor size was measured as the longest diameter on pre-treatment MRI, and post-treatment tumor size was obtained from surgical pathology. This indirect comparison was necessary because post-treatment MRI was not routinely acquired at the institution. In this definition, the response included both complete and partial response, whereas the non-response included stable or progressive disease. Additionally, residual tumor cellularity of less than 1% was also interpreted as an indicator of treatment response.

2.4. Image Acquisition and Segmentation

Pre-treatment MRI and CT scans were acquired as part of the standard of care. MRI scans were obtained from either a 1.5 T Signa system (GE Healthcare, Chicago, IL, USA) or a 1.5 T Aera system (Siemens Healthcare, Erlangen, Germany), both equipped with an 8-channel breast coil. The sequences analyzed included T2 and CE-T1 images. CT scans were performed with BrightSpeed or LightSpeed scanners (GE Healthcare, Chicago, IL, USA) using a chest/abdomen/pelvis protocol with contrast enhancement. Detailed acquisition parameters for MRI and CT are provided in Supplementary Tables S1 and S2. As per study inclusion criteria, all patients had CE-T1, T2, and CT images available for analysis.

Tumor segmentation was carried out manually using 3D Slicer (v5.2.1), an open-source software for medical image analysis. The breast tumor was contoured slice by slice for volumetric intratumoral segmentation in the sagittal plane for CE-T1 images and in the axial plane for CT images. T2 images were co-registered to CE-T1 by default, and the same intratumoral contours were applied. To characterize the surrounding tumor environment, peritumoral segmentation was generated by isotropically expanding the tumor boundary by 5 mm [23,33]. Portions of the peritumoral areas extending into the chest wall or skin were excluded to confine the analysis to breast tissue. All segmentations were reviewed independently by a radiologist and a radiation oncologist for validation. Examples of intratumoral and peritumoral segmentations on CE-T1, T2, and CT are presented in Figure 1.

2.5. Radiomic Feature Extraction

Preprocessing steps were applied prior to radiomic feature extraction to reduce acquisition-related variability, standardize image quality, and enhance feature reproducibility and stability. Different preprocessing strategies were employed for MRI and CT because MRI signal intensities are inherently non-standard and relative, while CT voxel values are quantified in Hounsfield Units (HU), a standardized physical scale. For MRI data (T2 and CE-T1), N4 bias field correction was performed to reduce intensity inhomogeneity, followed by Z-score normalization to minimize inter-scan variability, and gray level discretization with a fixed bin count of 128. For CT images, gray level discretization with a fixed bin width of 2 was applied without normalization, in order to preserve the physical meaning of HU. In addition, all images were resampled to an isotropic voxel spacing of 1 mm × 1 mm × 1 mm.

Radiomic features were extracted using the PyRadiomics package (v3.0.1) implemented in Python (v3.8.8) [34]. From each intratumoral and peritumoral region of each image, 14 shape features, 18 first-order features, and 75 second-order features were extracted. Shape features characterized the structural and volumetric properties of the ROIs [35]. First-order features describe the statistical distribution of voxel intensities within the segmented region [35]. Second-order features, also known as texture features, captured relationships between neighboring voxels and patterns of intensity variation [35]. The texture feature set comprised 24 Gray Level Co-occurrence Matrix (GLCM) features, 16 Gray Level Run Length Matrix (GLRLM) features, 16 Gray Level Size Zone Matrix (GLSZM) features, 14 Gray Level Dependence Matrix (GLDM) features, and 5 Neighboring Gray Tone Difference Matrix (NGTDM) features. From the two MRI sequences, a total of 400 features were extracted, including 28 shape, 72 first-order, and 300 texture features across intratumoral and peritumoral regions. It should be noted that shape features are independent of voxel intensity and were therefore extracted only once per segmentation rather than separately for each MRI sequence. From CT, 214 radiomic features were extracted, comprising 28 shape, 36 first-order, and 150 texture features across intratumoral and peritumoral regions. In total, 614 handcrafted features were extracted per patient. A full list of radiomic features is provided in Supplementary Table S3.

2.6. Deep Feature Extraction

Unsupervised deep feature extraction was implemented using a transfer learning approach with MedicalNet, which provides pre-trained weights for 3D ResNet architectures trained on large-scale medical imaging datasets [30]. For each patient, MR and CT images were pre-processed through isotropic resampling, segmentation-based cropping, and intensity normalization. Cropped volumes were subsequently resized to 64 × 64 × 64 voxels to standardize patch size. Four pre-trained MedicalNet backbones were tested, corresponding to ResNet10, ResNet18, ResNet34, and ResNet50 architectures, where the numerical designation indicates the total number of layers in the network. The processed images were passed through the ResNet architectures initialized with their corresponding MedicalNet pre-trained weights, and the global average pooling output of the final convolutional block was retained as the deep feature vector. For each image and segmentation, ResNet10, ResNet18, and ResNet34 yielded 512 features, whereas ResNet50 yielded 2048 features. The feature numbers correspond to the number of channels in the final convolutional block of each architecture. As features were derived separately for intratumoral and peritumoral segmentations across CE-T1, T2 and CT images, six region-specific feature sets were generated per patient. Therefore, a total of 3072 features were extracted from ResNet10, ResNet18 and ResNet34, while a total of 12,288 features were extracted from ResNet50.

2.7. Machine Learning

Five models were developed using handcrafted features and deep features from four ResNet architectures, each combined with the seven clinical features. As ground-truth labels, non-pCR and non-response were assigned as negative classes, while pCR and response were assigned as positive classes. Model training and performance evaluation were carried out using a repeated partitioning approach consisting of ten independent stratified splits that conserves the proportion between the majority and minority classes. In each partition, 80% of the patients were randomly assigned for training and the remaining 20% were assigned for hold-out testing. All preprocessing, feature selection, and model optimization steps were performed only within the training data to prevent information leakage, whereas performance evaluation was carried out on the corresponding hold-out test set. Performance was quantified using balanced accuracy, sensitivity, specificity, precision, F1-score, and the area under the receiver operating characteristic curve (AUC). The average and standard deviation of these metrics across the ten partitions were reported to provide a robust estimate of generalization performance.

Two preprocessing steps were implemented before feature selection. Firstly, ComBat harmonization was applied independently within each imaging modality to address scanner-related heterogeneity. ComBat harmonization models scanner or protocol effects as additive bias and multiplicative scaling in feature distributions. It estimates these parameters for each scanner using an empirical Bayesian model and then rescales the features so that all scanners share a unified statistical distribution while retaining inter-patient biological variability. This framework has been validated in both phantom [36] and clinical imaging [37,38] studies, demonstrating that ComBat reduces scanner-specific noise, enhances reproducibility, and improves classification performance. Subsequently, feature values were standardized using robust scaling to reduce the influence of outliers. A three-stage feature reduction strategy was then applied to reduce dimensionality and mitigate overfitting. In the first stage, the Minimum Redundancy Maximum Relevancy (mRMR) algorithm was applied separately to MRI features (CE-T1 and T2 combined) and to CT features, with each subset limited to 10% of the cohort size. In the second stage, the selected MRI and CT features were concatenated with clinical variables, and another round of mRMR was conducted, again feature numbers were limited to 10% of the cohort size. Lastly, recursive feature elimination (RFE) with an Extreme Gradient Boosting (XGBoost) classifier (v3.0.4) was employed to further refine the subset of predictors. RFE was implemented with five-fold cross-validation within the training set and performed independently within each partition. The significance of selected features was evaluated by recording the frequency with which they appeared across the ten iterations. The final classification models were trained using XGBoost, a gradient-boosted tree algorithm widely used in medical imaging research due to its accuracy and efficiency. Hyperparameters were optimized through grid search with five-fold cross-validation in the training data. the Synthetic Minority Over-sampling Technique for Nominal and Continuous features (SMOTENC) was applied within each training fold after partitioning, ensuring that synthetic samples were generated exclusively from the training data. This method generates synthetic samples of the minority class by interpolating between existing cases, thereby balancing the class distribution while preserving data characteristics [39]. The hyperparameter settings are detailed in Supplementary Table S4.

2.8. Statistical Analysis

Statistical analyses were conducted to compare the distributions of clinical variables and selected radiomic features between responsive and non-responsive groups under the two response criteria. Continuous variables were analyzed using an independent-samples t-test or a Mann–Whitney U test, based on the normality determined by the Shapiro–Wilk test. Categorical variables were analyzed using Fisher’s exact test. Differences in model performance between the best-performing model and the other models were evaluated using paired two-tailed t-tests. Across all analyses, a p-value less than 0.05 was considered statistically significant.

3. Results

3.1. Patient Characteristics

Based on eligibility, 177 patients were included in the study. For MRI, 153 patients were imaged using a GE Signa scanner, while 24 patients were imaged with a Siemens Aera scanner. For CT imaging, 127 patients were scanned with a GE Lightspeed system and 50 patients with a GE Brightspeed system. According to Fisher’s exact test, scanner type was not significantly associated with treatment response under either criterion. The potential impact of scanner-related variability on the extracted features was mitigated through ComBat harmonization, which standardized feature distributions across scanners while preserving biological variability.

The distributions of clinical variables between responsive and non-responsive groups are summarized in Table 1 and Table 2 for criterion 1 and criterion 2, respectively. For citerion 1, 37 patients (20.9%) and 140 patients (79.1%) were categorized as pCR and non-pCR, respectively. For criterion 2, the cohort consisted of 124 responders (70.0%) and 53 non-responders (30.0%). Statistical analysis demonstrated that the receptor status (ER, PR and HER2) and histological grade differed significantly between response groups in both criteria. In addition, under criterion 1, nodal status was significant, and tumor size demonstrated a trend toward significance (p = 0.073). Under criterion 2, age approached significance (p = 0.064).

3.2. Classification Results

The classification performance of the five models is shown in Figure 2 and Figure 3 for criterion 1 and criterion 2, respectively. Detailed results with standard deviations are provided in Supplementary Tables S5 and S6. Across both criteria, deep features extracted from ResNet34 yielded the strongest overall performance. Under criterion 1, the ResNet34 model achieved a balanced accuracy of 81.6%, precision of 94.3%, sensitivity of 80.7%, specificity of 82.5%, F1-score of 0.868, and an AUC of 0.871. Its balanced accuracy, precision, and specificity were significantly higher than those of the handcrafted, ResNet10, and ResNet50 models, while its AUC was significantly higher than those of the ResNet10, ResNet18, and ResNet50 models. Under criterion 2, the ResNet34 model achieved a balanced accuracy of 73.5%, precision of 57.9%, sensitivity of 70.9%, specificity of 76%, F1-score of 0.630, and an AUC of 0.762. For this criterion, the ResNet34 model achieved significantly higher balanced accuracy and F1-score compared with all other models. Its precision was significantly higher than that of the ResNet50 model, its recall was significantly higher than those of the ResNet10, ResNet18, and ResNet50 models, and its AUC was significantly higher than those of the ResNet18 and ResNet50 models. Results of paired two-tailed tests comparing ResNet34 with the other models are provided in Supplementary Tables S7 and S8.

3.3. Features Selected

As 10 independent data partitions were used during training, this produced 10 distinct sets of selected features. Therefore, the significance of each feature was inferred from its frequency of selection in the ResNet34 model, which provided the best overall classification performance. Table 3 and Table 4 list the features that were selected in at least five partitions for criterion 1 and criterion 2, respectively.

Since 177 patients were included in the study, 18 features, representing approximately 10% of the cohort size, were initially selected by mRMR, and subsequent RFE could further reduce the number of selected features to limit dimensionality and mitigate the risk of overfitting. Under criterion 1, 67 unique features were selected across 10 partitions, with an average of 15.9 per partition. Of these, 37 appeared in only one partition, whereas 11 recurred in at least five partitions. The recurrent set consisted of five clinical features and six deep features, including one derived from CE-T1, three from T2, and two from CT. Among six deep features, three originated from intratumoral regions and three from peritumoral regions. Under criterion 2, 73 unique features were selected across 10 partitions, with an average of 15.6 per partition. Of these, 44 appeared in a single partition, while nine recurred in at least five partitions. This group consisted of four clinical variables and five deep features, including one from CE-T1, two from T2, and two from CT. Among the five deep features, two were derived from intratumoral regions and three from peritumoral regions.

Boxplots illustrating the distribution of frequently selected deep features across response groups for criterion 1 and criterion 2 are shown in Figure 4 and Figure 5, respectively. Depending on the normality of each feature, either an independent samples t-test or a Mann–Whitney U test was applied to compare groups. For criterion 1, MRI_T2_Peri_0271, MRI_T2_Peri_0053, and MRI_T2_Intra_0110 exhibited significant differences between the pCR and non-pCR groups. For criterion 2, MRI_T2_Peri_0271 and MRI_T2_Intra_0271 differed significantly between responders and non-responders. Across both criteria, lower values of MRI_T2_Peri_0271 were consistently associated with response.

4. Discussion

In this study, a deep feature extraction framework was applied to pre-treatment MRI and CT scans, together with clinical information, to model response to neoadjuvant chemotherapy in breast cancer. Rather than relying solely on handcrafted radiomic features, this approach used convolutional neural networks pre-trained on large-scale medical imaging datasets to capture complex representations from both intratumoral and peritumoral regions. Two clinically relevant definitions of response were examined: one distinguishing pCR from non-pCR, and the other differentiating complete or partial response from stable or progressive disease. By incorporating multimodal and multiregional imaging features with established clinical predictors, the analysis demonstrates the potential of deep learning-based features to complement conventional radiomics and provide additional value for treatment response assessment.

Statistical analysis of the clinical variables indicated that histologic grade and receptor status were significantly associated with response in both criteria; tumors with high histologic grade, ER/PR negativity, and HER2 positivity were more likely to respond to NAC. This finding is consistent with the clinical understanding that more aggressive disease, characterized by these variables, is more responsive to anthracycline- and taxane-based regimens that preferentially target highly proliferative cells. Furthermore, the availability of subtype-specific treatments, such as targeted therapy for HER2-enriched breast cancer and immunotherapy for triple-negative breast cancer, enhances tumor response. In addition to these factors, nodal status was significantly associated with pCR, consistent with its definition requiring absence of nodal disease. Although neither age nor initial tumor size reached statistical significance, tumor size (p = 0.073) and age (p = 0.064) showed trend towards statistical significance for criterion 1 and criterion 2, respectively.

In this study, handcrafted radiomic features and deep features extracted from ResNet10, ResNet18, ResNet34, and ResNet50 were evaluated. In addition, clinical features were included in all feature sets for analysis, as previous MRI [40] and CT [41] studies demonstrated the utility of combining clinical and radiomic features to improve classification performance. Among the five models, the ResNet34 model achieved the best overall classification performance in both criteria. Most notably, the ResNet34 model outperformed the handcrafted model with significantly higher balanced accuracy, precision, and specificity under criterion 1. For criterion 2, the ResNet34 model achieved significantly higher balanced accuracy and F1-score. No significant differences were observed in the remaining performance metrics. This performance advantage demonstrated the utility of deep features for predicting treatment response. Deep features are quantitative image representations from convolutional neural networks pre-trained on large imaging datasets, with shallow layers capturing basic textures and deeper layers encoding higher-order structural patterns. Therefore, deep features may capture multi-scale patterns beyond the predefined mathematical definitions of handcrafted features, providing more comprehensive imaging representations that translate into superior classification performance. It is also important to note that other deep feature models involving ResNet10, ResNet18, and ResNet50 did not demonstrate superiority to the handcrafted feature model. ResNet10 and ResNet18 are shallow networks with 10 and 18 layers, each producing 512-dimensional feature vectors. ResNet34 increases depth to 34 layers while retaining 512-dimensional feature vectors, enabling richer hierarchical representations. Even though the number of layers differs among the three architectures, the number of final features remains the same because they share an identical overall design, with the last stage fixed at 512 channels prior to global average pooling. ResNet50 extends to 50 layers and generates 2048-dimensional feature vectors, greatly expanding the feature space. It is postulated that the weaker performance of ResNet10 and ResNet18 likely reflects their limited depth, which restricts extraction of complex multi-scale patterns and yields less discriminative features. In contrast, the high-dimensional representations from ResNet50 may introduce redundancy and overfitting, even after rigorous feature selection, particularly given the segmentation- and image-dependent extraction that expands the total feature dimension to 12,288. ResNet34 provides a balance, offering sufficient depth for higher-order abstractions while maintaining a manageable feature space, leading to more stable and discriminative representations.

The frequency of feature selection across ten partitions of training and test sets was used as an indicator of feature importance, and features selected at least five times were considered more significant. The clinical relevance of receptor status and histologic grade was highlighted again, as these variables were consistently selected in both criteria, with the exception of PR under criterion 2. In criterion 1, PR and HER2 status were both selected ten times, while ER status and histologic grade were selected seven times. In criterion 2, HER2 status and histologic grade were selected nine times while ER status was selected seven times. Notably, age, which was not statistically significant under either criterion, was selected five times and ten times for criterion 1 and criterion 2, respectively. This reflects that predictive relevance in multivariate machine learning models does not always correspond directly to univariate statistical associations [42].

In addition to these clinical predictors, deep features from CE-T1, T2, and CT were frequently selected, underscoring their complementary value for treatment response prediction. For criterion 1, six deep features were selected, including one from CE-T1, three from T2, and two from CT, with equal intratumoral and peritumoral representation. For criterion 2, five deep features were selected, including one from CE-T1, two from T2, and two from CT, with two intratumoral and three peritumoral features. The distribution of frequently selected deep features across CE-T1, T2, and CT suggests that each modality captures distinct biological information. CE-T1 captures contrast uptake, reflecting tumor vascularity, perfusion, and permeability, which are related to angiogenesis and necrosis [43]. T2 represents tissue water content, making it sensitive to edema, cystic or necrotic change, and stromal composition [44]. CT characterizes tissue density and enhancement, thereby reflecting vascularity as well as heterogeneity arising from necrosis, fibrosis, or calcification [45]. Furthermore, intratumoral features capture the internal heterogeneity of the tumor, such as cellularity, necrosis, and vascular patterns, whereas peritumoral features reflect the surrounding microenvironment, including stromal reaction, edema, and vascular remodeling. Together, these intratumoral and peritumoral deep features from multiple modalities integrate vascular, compositional, and density-based information, yielding a more comprehensive characterization of the tumor and its microenvironment relevant to treatment response prediction.

Several studies have investigated the utilization of MRI- or CT-based deep learning and deep feature approaches for response prediction in breast cancer. Li et al. conducted a multicenter investigation across four institutions involving 1048 patients [46]. A predictive model was developed that combined handcrafted radiomic features with ResNet50-derived deep features from pre- and mid-treatment MRI to classify residual cancer burden (RCB) using an SVM classifier. When evaluated on pre-treatment MRI, the model yielded AUCs of 0.817, 0.787, and 0.809 across three independent external validation cohorts for distinguishing RCB 0–I from RCB II–III. For separating RCB 0–II from RCB III, the model achieved AUCs of 0.833, 0.819, and 0.801. Peng et al. utilized pretreatment CE-T1 to compare a ResNeXt50-based deep learning model with handcrafted radiomics for pCR in 356 breast cancer patients [47]. Upon five-fold cross-validation, the integrative deep learning model that combined imaging, kinetic parameters, and molecular information achieved an AUC of 0.83, outperforming the integrative handcrafted radiomic model using a linear discriminant analysis (LDA) classifier, which achieved an AUC of 0.781. Dammu et al. developed ResNet-based deep learning models incorporating longitudinal multiparametric MRI and clinical information to predict pCR in 155 breast cancer patients [48]. With pre-treatment MRI combined with clinical information, an accuracy of 0.81 and an AUC of 0.83 were achieved. Li et al. investigated the extraction of deep features using UCTransNet and handcrafted radiomic features extracted with Pyradiomics from pre- and early-treatment CE-T1, along with clinical information, to predict pCR after NAC [49]. Using a 70% training and 30% hold-out test split, the combined model from the pre-treatment imaging data achieved an AUC of 0.738. Moslemi et al. evaluated 117 patients with LABC using pretreatment CT with multiple deep learning architectures, including transfer learning CNNs (VGG16/19, ResNet50/101/152, InceptionV3, Xception) and a Vision Transformer trained end-to-end, to predict NAC response [50]. With 10 iterations of a 70:30 split for training and testing, the Transformer ViT model achieved a balanced accuracy of 66%, outperforming the other CNN-based models. However, that study utilized CNNs pre-trained on the ImageNet-1k dataset, which consists of generic natural images rather than medical imaging. As a result, the transfer learning applied in this context may not adequately capture the complexity of 3D medical imaging. Falou et al. investigated 174 patients with LABC using pre-treatment quantitative ultrasound (QUS) parametric maps for predicting NAC response [51]. A transfer learning approach with ResNet50V2 was used for deep feature extraction from 2D intratumoral and peritumoral segmentations. Using a support vector machine (SVM) classifier, the spectral slope-based model achieved the best performance on an independent test set with a balanced accuracy of 86% for identifying non-responders.

Collectively, previous studies demonstrate the potential of deep learning and deep feature approaches for predicting NAC response, although most have been limited to single-modality analyses, intratumoral segmentation, or models pre-trained on natural image datasets. The present study introduces a multi-modal and multi-segmental framework that integrates intratumoral and peritumoral deep features from CE-T1- and T2-weighted MRI and contrast-enhanced CT, allowing a more comprehensive characterization of tumor heterogeneity and the surrounding tissue environment. MedicalNet-based three-dimensional architectures pre-trained on large medical imaging datasets were used to extract volumetric features that better reflect the spatial complexity of clinical imaging. In addition, two independent response criteria, pathological complete response and a RECIST-based classification, were examined to evaluate the consistency of predictive performance across distinct clinical endpoints.

The study here has several limitations that should be acknowledged. First, the dataset was derived from a single institution and included a relatively small number of patients, which limits the generalizability of the findings. Although repeated cross-validation with ten independent partitions aimed to provide robust and unbiased performance estimates, it cannot fully account for systematic biases inherent to a single-center dataset. To address this, future studies should prioritize external validation on larger, multi-institutional cohorts collected with heterogeneous imaging protocols, which would allow a more rigorous assessment of model robustness in diverse clinical environments. Expanding the cohort size would also create opportunities for multi-class classification of non-responders, partial responders, and complete responders, while ensuring adequate sample sizes for each subgroup. Furthermore, a larger and multi-institutional cohort would enable stratified modeling by molecular subtype, as therapeutic response and prognosis differ substantially across Luminal A, Luminal B, HER2-enriched, and triple-negative cancers. In addition, such a cohort would allow examination of potential differences in treatment response between patients receiving the standard and dose-dense regimens, as well as potentially other non-standard chemotherapy protocols.

5. Conclusions

In conclusion, this study evaluated a deep learning-based framework for predicting response to neoadjuvant chemotherapy in breast cancer using pre-treatment MRI and CT scans together with clinical information. Among the tested architectures, ResNet34 achieved the best performance for both classifications, distinguishing pCR from non-pCR and responders from non-responders. It outperformed handcrafted radiomics as well as the other ResNet models. Through transfer learning, deep feature extraction provided intratumoral and peritumoral representations that captured greater complexity than handcrafted radiomics. These findings suggest that routinely available imaging can be utilized for multimodal deep feature extraction to complement conventional approaches and support more personalized treatment strategies by identifying patients less likely to benefit from standard NAC regimens.

Bibliography51

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Guo L. Kong D. Liu J. Zhan L. Luo L. Zheng W. Zheng Q. Chen C. Sun S. Breast Cancer Heterogeneity and Its Implication in Personalized Precision Therapy Exp. Hematol. Oncol.202312310.1186/s 40164-022-00363-136624542 PMC 9830930 · doi ↗ · pubmed ↗
2Loibl S. AndréF. Bachelot T. Barrios C.H. Bergh J. Burstein H.J. Cardoso M.J. Carey L.A. Dawood S. Del Mastro L. Early Breast Cancer: ESMO Clinical Practice Guideline for Diagnosis, Treatment and Follow-Up Ann. Oncol.20243515918210.1016/j.annonc.2023.11.01638101773 · doi ↗ · pubmed ↗
3Korde L.A. Somerfield M.R. Carey L.A. Crews J.R. Denduluri N. Hwang E.S. Khan S.A. Loibl S. Morris E.A. Perez A. Neoadjuvant Chemotherapy, Endocrine Therapy, and Targeted Therapy for Breast Cancer: ASCO Guideline J. Clin. Oncol.2021391485150510.1200/JCO.20.0339933507815 PMC 8274745 · doi ↗ · pubmed ↗
4Mieog J.S.D. Van De Velde C.J.H. Neoadjuvant Chemotherapy for Early Breast Cancer Expert. Opin. Pharmacother.2009101423143410.1517/1465656090300210519505212 · doi ↗ · pubmed ↗
5Asaoka M. Gandhi S. Ishikawa T. Takabe K. Neoadjuvant Chemotherapy for Breast Cancer: Past, Present, and Future Breast Cancer 202014117822342098037710.1177/117822342098037733402827 PMC 7747102 · doi ↗ · pubmed ↗
6Cortazar P. Zhang L. Untch M. Mehta K. Costantino J.P. Wolmark N. Bonnefoi H. Cameron D. Gianni L. Valagussa P. Pathological Complete Response and Long-Term Clinical Benefit in Breast Cancer: The CT Neo BC Pooled Analysis Lancet 201438416417210.1016/S 0140-6736(13)62422-824529560 · doi ↗ · pubmed ↗
7Von Minckwitz G. Untch M. Blohmer J.U. Costa S.D. Eidtmann H. Fasching P.A. Gerber B. Eiermann W. Hilfrich J. Huober J. Definition and Impact of Pathologic Complete Response on Prognosis after Neoadjuvant Chemotherapy in Various Intrinsic Breast Cancer Subtypes J. Clin. Oncol.2012301796180410.1200/JCO.2011.38.859522508812 · doi ↗ · pubmed ↗
8Spring L.M. Fell G. Arfe A. Sharma C. Greenup R. Reynolds K.L. Smith B.L. Alexander B. Moy B. Isakoff S.J. Pathologic Complete Response after Neoadjuvant Chemotherapy and Impact on Breast Cancer Recurrence and Survival: A Comprehensive Meta-Analysis Clin. Cancer Res.2020262838284810.1158/1078-0432.CCR-19-349232046998 PMC 7299787 · doi ↗ · pubmed ↗