Learning What is Worth Learning: Active and Sequential Domain Adaptation for Multi-modal Gross Tumor Volume Segmentation
Jingyun Yang, Guoqing Zhang, Jingge Wang, and Yang Li

TL;DR
This paper introduces a novel active and sequential domain adaptation framework for multi-modal medical image segmentation, reducing annotation effort and improving segmentation accuracy in tumor volume delineation tasks.
Contribution
It proposes a new query strategy for dynamic sample selection in active domain adaptation, addressing negative transfer and multi-modal data challenges.
Findings
Outperforms state-of-the-art ADA methods in segmentation accuracy
Reduces annotation costs through effective sample prioritization
Demonstrates robustness across diverse tumor segmentation tasks
Abstract
Accurate gross tumor volume segmentation on multi-modal medical data is critical for radiotherapy planning in nasopharyngeal carcinoma and glioblastoma. Recent advances in deep neural networks have brought promising results in medical image segmentation, leading to an increasing demand for labeled data. Since labeling medical images is time-consuming and labor-intensive, active learning has emerged as a solution to reduce annotation costs by selecting the most informative samples to label and adapting high-performance models with as few labeled samples as possible. Previous active domain adaptation (ADA) methods seek to minimize sample redundancy by selecting samples that are farthest from the source domain. However, such one-off selection can easily cause negative transfer, and access to source medical data is often limited. Moreover, the query strategy for multi-modal medical data…
| Method | ET | NCR | ED | NPC | |||||
|---|---|---|---|---|---|---|---|---|---|
| Strategy | Dice (%) | mIoU (%) | Dice (%) | mIoU (%) | Dice (%) | mIoU (%) | Dice (%) | mIoU (%) | |
| Lower bound | 0 | 77.70 | 64.39 | 61.51 | 49.46 | 69.85 | 56.15 | 67.63 | 51.71 |
| Upper bound | 80 | 94.10 | 89.20 | 78.94 | 69.72 | 92.25 | 87.37 | 76.12 | 61.85 |
| Random Selection | 3 | 78.09 | 67.52 | 60.12 | 51.72 | 73.33 | 59.73 | 66.51 | 51.07 |
| AADA (WACV’ 20) | 3 | 81.51 | 74.53 | 65.66 | 56.95 | 76.30 | 64.20 | 69.71 | 53.91 |
| MHPL (CVPR’ 23) | 3 | 88.63 | 79.62 | 62.85 | 51.72 | 77.67 | 68.32 | 68.01 | 52.81 |
| CUP (MICCAI’ 24) | 3 | 83.60 | 75.12 | 64.68 | 47.80 | 77.58 | 67.74 | 69.04 | 53.79 |
| STDR (TMI’ 24) | 3 | 88.65 | 81.09 | 62.53 | 50.86 | 78.08 | 68.84 | 71.96 | 56.95 |
| LAMDA (ECCV’22) | 3 | 89.45 | 80.91 | 66.50 | 49.81 | 78.67 | 68.74 | 71.17 | 56.26 |
| Ours | 3 | 91.11 | 84.13 | 72.30 | 64.14 | 82.82 | 74.10 | 74.69 | 59.96 |
| Modality | ET | NCR | ED | NPC | ||||
|---|---|---|---|---|---|---|---|---|
| Dice (%) | mIoU (%) | Dice (%) | mIoU (%) | Dice (%) | mIoU (%) | Dice (%) | mIoU (%) | |
| FLAIR | 46.06 | 31.62 | 37.75 | 26.77 | 73.21 | 59.99 | - | - |
| T1 | 13.96 | 8.84 | 35.01 | 25.69 | 51.09 | 38.27 | 64.47 | 50.18 |
| T1c | 87.26 | 79.30 | 64.59 | 55.85 | 59.63 | 47.97 | 69.36 | 53.67 |
| T2 | 22.87 | 14.57 | 39.91 | 28.99 | 63.47 | 51.30 | 66.11 | 51.71 |
| Multiple | 91.11 | 84.13 | 72.30 | 64.14 | 82.82 | 74.10 | 74.69 | 59.96 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI in cancer detection · Radiomics and Machine Learning in Medical Imaging · Medical Image Segmentation Techniques
11institutetext: Shenzhen Key Laboratory of Ubiquitous Data Enabling, Tsinghua Shenzhen International Graduate School, Tsinghua University
11email: [email protected]
Learning What is Worth Learning: Active and Sequential Domain Adaptation for Multi-modal Gross Tumor Volume Segmentation
Jingyun Yang Jingyun Yang and Guoqing Zhang contributed equally to this work.
Guoqing Zhang⋆
Jingge Wang
and Yang Li
Abstract
Accurate gross tumor volume segmentation on multi-modal medical data is critical for radiotherapy planning in nasopharyngeal carcinoma and glioblastoma. Recent advances in deep neural networks have brought promising results in medical image segmentation, leading to an increasing demand for labeled data. Since labeling medical images is time-consuming and labor-intensive, active learning has emerged as a solution to reduce annotation costs by selecting the most informative samples to label and adapting high-performance models with as few labeled samples as possible. Previous active domain adaptation (ADA) methods seek to minimize sample redundancy by selecting samples that are farthest from the source domain. However, such one-off selection can easily cause negative transfer, and access to source medical data is often limited. Moreover, the query strategy for multi-modal medical data remains unexplored. In this work, we propose an active and sequential domain adaptation framework for dynamic multi-modal sample selection in ADA. We derive a query strategy to prioritize labeling and training on the most valuable samples based on their informativeness and representativeness. Empirical validation on diverse gross tumor volume segmentation tasks demonstrates that our method achieves favorable segmentation performance, significantly outperforming state-of-the-art ADA methods. Code is available at the git repository: mmActS.
Keywords:
Gross tumor volume segmentation Active domain adaptation Sequential selection Multi-modal learning.
1 Introduction
Precise delineation of the Gross Tumor Volume (GTV) plays a pivotal role in ensuring effective radiotherapy for prevalent malignancies such as nasopharyngeal carcinoma, predominantly impacting the head and neck area [21], and glioblastoma, presumably originating from glial cells and posing a severe threat to human health [14]. Magnetic Resonance Imaging (MRI) is widely used for tumor detection due to its high soft tissue contrast and non-invasive nature, while multi-modal MRI data can map various tumor-induced tissue changes [23]. For example, FLAIR highlights tissue water relaxation differences, while post-Gadolinium T1 reveals intratumoral contrast uptake [1].
Manual GTV delineation on multi-modal MRI images is time-consuming and subject to inter-observer variability. Recent advances in deep learning have brought promising results in automatic medical image analysis, yielding successful models for various segmentation tasks [4, 24, 13]. However, the generalization capability of these models is limited by the large variability in training data and the lack of labeled data [25].
One approach to ensuring reliable and robust model adaptation is Active Domain Adaptation (ADA) [20], where the most informative samples are actively selected to be labeled and fine-tune high-performance models with as few labeled samples as possible, as shown in Fig. 1. Previous ADA methods seek to minimize the sample redundancy either by optimizing for minimal cosine similarity to existing training data [11], or by combining source feature embeddings as the clustering reference [21]. However, these approaches presuppose access to source data—an assumption that often fails in medical imaging due to regulatory constraints and ethical concerns. Source-free methods [20, 18] try to select samples with higher epistemic uncertainty based on prediction probabilities with a single forward pass. Nonetheless, such static one-off selection schemes, which neglect the evolving training dynamics and the inherent domain shift, can easily cause negative transfer. Moreover, none of the aforementioned methods fully explore the characteristics of multi-modal medical images, e.g., multi-sequence MRI. Works like [6] utilize transformer layers to integrate features extracted from each modality, while self-attention-based convolution methods enable weighted fusion of multi-modal MRI data [10]. This raises the question: how can multi-modal data be actively combined to better adapt the model?
To address the above issues, we propose an active and sequential domain adaptation framework for advancing gross tumor volume segmentation on multi-modal data in a source-free manner. To the best of our knowledge, this is the first work to propose a query strategy for multi-modal medical data. Using a novel dynamic sample selection strategy, we prioritize labeling and training on samples that are worth learning. Specifically, in each selection round, we estimate the uncertainty, objective abundance, and density of each sample as indicators of their informativeness and representativeness. Taking into account all these factors, we identify the most valuable sample in the current state. Then, a dominant modality election procedure is introduced to select the modality that exhibits promising performance for annotation, substantially reducing the annotation burden. By optimizing the use of rare medical resources, both multi-modal data and clinician efforts, our method significantly enhances GTV segmentation. Extensive experiments on benchmark 3D MRI datasets with various tumor segmentation tasks validate the effectiveness of our method, outperforming all the other state-of-the-art ADA methods. We believe our proposed approach will better leverage rare medical resources, including multi-modal data and clinician expertise, to adapt a model for the desired target task in a fast and scalable way. In summary, our main contributions are:
- •
An active and sequential domain adaptation framework: we propose a novel framework that dynamically selects the most valuable samples to learn, enabling an effective adaptation of well-trained source models to target tasks in a source-free manner with as few labeled samples as possible.
- •
An multi-modal sample query strategy: we derive an effective query strategy for dynamic multi-modal sample selection and significantly reduce labeling costs, optimizing the use of rare medical resources.
2 Methodology
In this section, we present the proposed active and sequential domain adaptation framework in the context of multi-modal medical image segmentation, shown in Fig. 2. First, we clarify the setting of active learning with multi-modal data. Then we introduce the sequential query strategy and define the selection criterion, with the consideration of the informativeness and representativeness of each sample. Finally, we describe the target model fine-tuning procedure.
2.1 Problem Definition
ADA problem setting. For active learning problems, given a target domain dataset , we select the most valuable samples to label and learn. Assuming access to a source model , pre-trained on adequate source domain data that is not always accessible in medical scenarios, the goal is to adapt to achieve high performance on with as few labeled samples as possible.
Multi-modal Learning. Let us represent the M-modality sample with where is the -th modality scan of sample . To effectively fuse information from multiple modalities, we train the model in a multi-channel manner. Specifically, we stack the different modalities along the channel dimension, forming a multi-channel input fed into the model. The convolutional layers process all channels jointly, enabling the model to capture both modality-specific and cross-modality features for improved representation learning.
2.2 Query Strategy
Instead of a one-off selection, we sequentially select the most valuable samples until the labeling budget is exhausted, as the model evolves and the most informative sample changes at each time step. We derive a selection criterion to assess each sample in the unlabeled pool . In the -th selection round, the most valuable sample is selected and future undergoes a dominant modality (DM) election procedure for all to identify the modality that exhibits more promising performance than others. Next we query the label for scan from oracle, yielding the labeled pair . Then the model is trained on the labeled set from this selection round to the next.
Selection criterion. In domain adaptation tasks, the source model has already acquired some fundamental knowledge [7, 25].
To achieve good target performance, we estimate the informativeness and representativeness of each sample to capture the complexities and variabilities in the target data. Given a labeling budget, we prioritize annotating and training on samples that are worth learning, enabling the model to learn robust representation capabilities.
At the time step , we have the current model and obtain the predicted mask for sample : .
In the -th selection round, we quantify the informativeness of sample , by jointly considering its predictive uncertainty and the objective abundance, estimated via the total predicted volume:
[TABLE]
where the summation denotes the number of voxels predicted as foreground in the segmentation mask . The uncertainty score is computed as the mean voxel-wise entropy across all predictions for sample . Specifically, for a model output with classes and voxels, we define:
[TABLE]
where denotes the entropy function and is the voxel-level uncertainty map. Inspired by [22], two components in the measure are interpreted as follows:
- uncertainty cue, and 2) concentration cue. The former identifies data that the model cannot predict confidently, while the latter indicates data with a higher concentration of objectives, necessitating oracle annotation for precise model training and refinement.
Meanwhile, we quantify the representativeness of based on its density,
[TABLE]
where is the distance between samples and measured by the Wasserstein distance [16] for its ability to handle shifts in data distributions and is the neighborhood distance threshold. Specifically, given a pair of samples with M modalities, is defined as:
[TABLE]
where are distributions of images after dimension reduction using principal components analysis. And the data-pair Wasserstein distance is defined as:
[TABLE]
Finally, the selection criterion for unlabeled target data is written as:
[TABLE]
Accordingly, in the -th selection round, the most valuable sample is selected, , and future undergoes a DM election procedure to identify the dominant modality scan for annotation.
Label Query. In the dominant modality election procedure, a validation process is applied: For the selected sample , the modality that exhibits promising performance is selected to be annotated. Formally, we have:
[TABLE]
where the pseudo-label is estimated by , indexes different modalities, is the current model, and is the function to calculate the dice score. We query the label for the dominant modality scan from oracle (e.g., doctors), yielding the labeled pair and the updated labeled set .
2.3 Active and Sequential Model Training
After the -th oracle annotation round, we train the model using the labeled set in a supervised manner. It is worth noting that if the model initialization is not proper to start the active learning process, it may produce meaningless informativeness estimation for the target domain [20]. A one-off selection based on such criteria can easily cause negative transfer. To mitigate this, we sequentially select the most valuable sample to label in each query round, based on the model’s current state, until the labeling budget is reached, i.e., , and fine-tune the model incrementally. After the final query round, the labeled samples remain significantly fewer than the total target samples. i.e., . As the extra computational cost for sample assessment is minimal (on the order of seconds), the query strategy adds negligible overhead. The proposed algorithm is shown in Alg. 1.
3 Experiments and Results
3.1 Datasets and Training Setup
Two multi-modal GTV segmentation datasets are used in our work: BraTS 2022 [14, 2, 3] and NPC 2024 [12, 21]. BraTS 2022 includes 3D MRI volumes across FLAIR, T1, T1c, and T2 modalities, segmenting for enhancing tumor (ET), edema (ED), and necrotic core (NCR). NPC 2024 is characterized by nasopharyngeal carcinoma (NPC) and manually delineated on each slice of the patient’s T1, T1c, and T2 MRI images. For both datasets, we use 80 cases for training and 20 for evaluation. We implement all methods on pre-trained nnU-Net [9] for all experiments, following the pre-training settings of MONAI [4]. The models are trained using an A800 80GB GPU for a maximum of 600 epochs with a batch size of 5 and an initial learning rate of 0.01, decayed following the poly learning rate policy [5]. For the query strategy, the selection stride is set to 40 epochs. For the label budget, following the common few-shot active learning setting in [15], all the experiments are conducted under a 1-way 3-shot scenario using 5-fold cross-validation. We experimented with budgets of labeling 5, 3, 2, and 1 target samples, and found that using 3 samples strikes a good balance, achieving satisfactory performance while keeping annotation costs low.
3.2 Performance evaluation
To investigate the effectiveness and efficiency of ADA, we consider two baselines: direct inference without fine-tuning (lower bound) and fine-tuning the model with all samples labeled (upper bound). Meanwhile, we compare our framework with five state-of-the-art ADA methods, alongside random selection:
- AADA [17] adversarially adapts the model with importance sampling,
- MHPL [19] exploits minimum happy points based on neighbor uncertainty and diversity,
- STDR [21] selects domain-invariant and -specific samples referenced to source domain points,
- a cascade sampling strategy CUP [22] based on prediction informativeness,
- a multi-round selection strategy LAMDA [8] with label distribution matching using a density-aware active sampling. We keep the label ratio and query stride the same, ensuring a consistent analysis. We evaluate the segmentation performance using the Dice score and the mean IoU. A quantitative analysis of model adaptation performance on BraTS 2022 and NPC 2024 datasets is detailed in Table. 1 and visualized in Fig. 3. The results are averaged over three independent runs with different training data splits to ensure robustness. Our proposed framework significantly outperforms all state-of-the-art ADA methods on both datasets across anatomical regions. Compared to random selection, it achieves an average Dice score gain of 16.62% on BraTS 2022 and 12.23% on NPC 2024. Furthermore, compared to LAMDA, our method yields an average Dice gain of 5.28% on BraTS 2022 and 4.95% on NPC 2024.
Effectiveness of sequential selection. Adequate source data for label matching and joint training is usually not available in medical scenes. To further investigate the effectiveness of sequential selection, we conducted one-off selection experiments. For the NPC 2024 dataset, one-off selection with our criterion achieves a Dice score of 0.7103, compared to 0.7117 with LAMDA, 0.7196 with STDR, and 0.7469 with our sequential selection. For the BraTS 2022 dataset, our one-off selection yields an average Dice score of 0.8072, outperforming LAMDA (0.7821) and STDR (0.7642), while our sequential selection achieves the highest score of 0.8208. These results highlight the superiority of our sequential selection scheme without relying on any source domain data while achieving SOTA performance.
Effectiveness of multi-modal learning. The results in Table. 2 show that multi-modal learning can indeed enhance tumor segmentation. Clinicians often provide scans with multiple modalities since multi-sequence data is easily collected, making it more efficient to use all available data rather than randomly choosing a single modality, especially given the performance variations observed in ET results.
4 Conclusion and Future Work
We propose a novel source-free active and sequential domain adaptation framework for advancing GTV segmentation on multi-modal medical data. Experiments on two benchmark medical datasets demonstrate that the proposed method achieves state-of-the-art performance in ADA problems within the realm of medical image processing. One limitation of our method lies in the informativeness criterion that is adapted from [22], originally proposed for vessel segmentation tasks. While it effectively identifies salient regions, it tends to bias the selection toward larger tumors. This can be problematic in datasets with varying tumor sizes, leading to suboptimal performance on early-stage or small-sized tumors. Moreover, our current experiments are limited to U-Net-based architectures. For future research, we plan to incorporate more advanced foundation models such as Med-SAM[13] and explore robust sampling strategies to improve segmentation performance on small but significant lesions and enhance the overall clinical utility of our approach.
{credits}
4.0.1 Acknowledgements
This work is supported in part by the Natural Science Foundation of China (Grant 62371270).
4.0.2 \discintname
The authors have no conflicts of interest to declare that are relevant to the content of this article.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Bai, J.W., Qiu, S.Q., Zhang, G.J.: Molecular and functional imaging in cancer-targeted therapy: current applications and future directions. Signal Transduction and Targeted Therapy 8 (1), 89 (2023)
- 2[2] Baid, U., Ghodasara, S., Mohan, S., Bilello, M., Calabrese, E., Colak, E., Farahani, K., Kalpathy-Cramer, J., Kitamura, F.C., Pati, S., et al.: The rsna-asnr-miccai brats 2021 benchmark on brain tumor segmentation and radiogenomic classification. ar Xiv preprint ar Xiv:2107.02314 (2021)
- 3[3] Bakas, S., Akbari, H., Sotiras, A., Bilello, M., Rozycki, M., Kirby, J.S., Freymann, J.B., Farahani, K., Davatzikos, C.: Advancing the cancer genome atlas glioma mri collections with expert segmentation labels and radiomic features. Scientific data 4 (1), 1–13 (2017)
- 4[4] Cardoso, M.J., Li, W., Brown, R., Ma, N., Kerfoot, E., Wang, Y., Murrey, B., Myronenko, A., Zhao, C., Yang, D., et al.: Monai: An open-source framework for deep learning in healthcare. ar Xiv preprint ar Xiv:2211.02701 (2022)
- 5[5] Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), 834–848 (2017)
- 6[6] Cho, J., Park, J.: Hybrid-fusion transformer for multisequence mri. In: International Conference on Medical Imaging and Computer-Aided Diagnosis. pp. 477–487. Springer (2022)
- 7[7] Guan, H., Liu, M.: Domain adaptation for medical image analysis: a survey. IEEE Transactions on Biomedical Engineering 69 (3), 1173–1185 (2021)
- 8[8] Hwang, S., Lee, S., Kim, S., Ok, J., Kwak, S.: Combating label distribution shift for active domain adaptation. In: European Conference on Computer Vision. pp. 549–566. Springer (2022)
