# Timepoint-Specific Benchmarking of Deep Learning Models for Glioblastoma Follow-Up MRI

**Authors:** Wenhao Guo, Golrokh Mirzaei

PMC · DOI: 10.3390/cancers18010036 · Cancers · 2025-12-22

## TL;DR

This study benchmarks deep learning models for analyzing MRI scans of glioblastoma patients at different follow-up timepoints to distinguish tumor growth from treatment effects.

## Contribution

The paper introduces the first timepoint-specific benchmark for deep learning models in glioblastoma MRI follow-up analysis.

## Key findings

- Model accuracy was similar at both follow-up timepoints, ranging from 70% to 74%.
- The second follow-up showed better separation of clinical outcomes, with improved F1 scores for the best model.
- A Mamba+CNN hybrid model offered the best balance of accuracy and efficiency.

## Abstract

Glioblastoma is an aggressive brain cancer, and follow-up MRI scans are used to determine whether changes after treatment represent real tumor growth or temporary treatment effects. This decision is difficult, especially at the first follow-up. We analyzed 180 patients and compared eleven deep learning models across two follow-up timepoints. Overall accuracy was similar at both timepoints, ranging from about 70% to 74%. However, the second follow-up provided clearer separation between the three clinical outcomes, with the best model improving its F1 score from 0.44 at the first follow-up timepoint to 0.53 at the second follow-up timepoint. A model that combines convolutional features with a state-space sequence method consistently gave the best balance of accuracy and efficiency, while some transformer models reached higher AUC values but required much more computation. These findings offer a practical benchmark to guide future research and clinical tool development.

Background: Differentiating true tumor progression (TP) from treatment-related pseudoprogression (PsP) in glioblastoma remains challenging, especially at early follow-up. Methods: We present the first timepoint-specific, cross-sectional benchmarking of deep learning models for follow-up MRI using the Burdenko GBM Progression cohort (n = 180). We analyze different post-RT scans independently to test whether architecture performance depends on timepoint. Eleven representative DL families (CNNs, LSTMs, hybrids, transformers, and selective state-space models) were trained under a unified, QC-driven pipeline with patient-level cross-validation. Across both timepoints, accuracies were comparable (~0.70–0.74), but discrimination improved at the second follow-up, with F1 and AUC increasing for several models, indicating richer separability later in the care pathway. Results: A Mamba+CNN hybrid consistently offered the best accuracy–efficiency trade-off, while transformer variants delivered competitive AUCs at substantially higher computational cost, and lightweight CNNs were efficient but less reliable. Performance also showed sensitivity to batch size, underscoring the need for standardized training protocols. Notably, absolute discrimination remained modest overall, reflecting the intrinsic difficulty of TP vs. PsP and the dataset’s size and imbalance. Conclusions: These results establish a timepoint-aware benchmark and motivate future work incorporating longitudinal modeling, multi-sequence MRI, and larger multi-center cohorts.

## Linked entities

- **Diseases:** glioblastoma (MONDO:0018177)

## Full-text entities

- **Diseases:** tumor (MESH:D009369), GBM (MESH:D005910), DL (MESH:C537113), Glioblastoma (MESH:D005909)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12784772/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12784772/full.md

## References

38 references — full list in the complete paper: https://tomesphere.com/paper/PMC12784772/full.md

---
Source: https://tomesphere.com/paper/PMC12784772