Disease Severity Regression with Continuous Data Augmentation
Shumpei Takezaki, Kiyohito Tanaka, Seiichi Uchida, Takeaki Kadota

TL;DR
This paper introduces a continuous data augmentation method using a novel GAN to generate medical images at real-valued severity levels, improving disease severity regression accuracy.
Contribution
It proposes a continuous severity GAN and dataset-disjoint multi-objective optimization to enhance medical image data augmentation for severity estimation.
Findings
Achieved higher classification performance than conventional methods.
Effectively generated images at real-valued severity levels.
Improved disease severity regression accuracy.
Abstract
Disease severity regression by a convolutional neural network (CNN) for medical images requires a sufficient number of image samples labeled with severity levels. Conditional generative adversarial network (cGAN)-based data augmentation (DA) is a possible solution, but it encounters two issues. The first issue is that existing cGANs cannot deal with real-valued severity levels as their conditions, and the second is that the severity of the generated images is not fully reliable. We propose continuous DA as a solution to the two issues. Our method uses continuous severity GAN to generate images at real-valued severity levels and dataset-disjoint multi-objective optimization to deal with the second issue. Our method was evaluated for estimating ulcerative colitis (UC) severity of endoscopic images and achieved higher classification performance than conventional DA methods.
| Method | Precision | Recall | F1-score |
|---|---|---|---|
| Regression (Baseline) | 0.782 | 0.631 | 0.652 |
| + Classic DA | 0.731 | 0.657 | 0.668 |
| + GAN-based DA | 0.697∗ | 0.651 | 0.663 |
| C-DA w/o GAN | 0.744 | 0.629 | 0.648 |
| C-DA () | 0.743 | 0.624 | 0.638 |
| C-DA () | 0.717∗ | 0.690∗ | 0.696∗ |
| C-DA () | 0.688∗ | 0.672∗ | 0.675 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCOVID-19 diagnosis using AI · AI in cancer detection · Ideological and Political Education
Disease Severity Regression with Continuous Data Augmentation
Abstract
Disease severity regression by a convolutional neural network (CNN) for medical images requires a sufficient number of image samples labeled with severity levels. Conditional generative adversarial network (cGAN)-based data augmentation (DA) is a possible solution, but it encounters two issues. The first issue is that existing cGANs cannot deal with real-valued severity levels as their conditions, and the second is that the severity of the generated images is not fully reliable. We propose continuous DA as a solution to the two issues. Our method uses continuous severity GAN to generate images at real-valued severity levels and dataset-disjoint multi-objective optimization to deal with the second issue. Our method was evaluated for estimating ulcerative colitis (UC) severity of endoscopic images and achieved higher classification performance than conventional DA methods.
**Index Terms— ** Data augmentation, generative adversarial network, endoscopic images
1 Introduction
Disease severity regression is a task to determine a function that satisfies for a given dataset , where is a medical image, such as an endoscopic image, and is its severity level. Nowadays, it is common to use a convolutional neural network (CNN) as the model of because CNN has a powerful representation ability to deal with the nonlinear relationship between image appearance and its severity. It is also common to use discrete severity levels as . For example, Mayo scores of endoscopic images with ulcerative colitis (UC) have levels.
If the labeled dataset is too small to train the CNN, data augmentation (DA) is often employed to generate synthetic data from . A possible DA technique is a conditional generative adversarial network (cGAN). Given a discrete severity level as the condition, cGAN generates various images at the severity level . The generated images are then used to train the CNN together with the original dataset .
This paper focuses on two issues of the above cGAN-based DA for disease severity regression. The first issue is that disease severity is inherently continuous, so we do not need to adhere to the discrete conditions as . In other words, generating images at real-valued severity levels will help train the CNN appropriately. The second issue is that the severity of the generated image is not very reliable. Even if we generate an image with the condition , there might be a risk that the visual severity of is precisely equal to .
We propose a continuous DA scheme, where a new technique tackles each issue. For the first issue, we propose a continuous severity GAN (csGAN). Fig. 1 (a) shows the overview of csGAN. Our csGAN is trained with images with discrete levels () but can generate images at real-valued severity levels ().
For the second issue, we use a dataset-disjoint multi-objective optimization, where the original dataset (with discrete levels) and the augmented dataset (with real-valued levels) are used in different ways according to their different reliability. Specifically, as shown in Fig. 1 (b), we train a CNN with a regression loss for and a ranking loss for . The former works to satisfy and the latter when . This means that the levels of the augmented data are not used as absolute ground truth but as relative conditions for training .
The proposed techniques are evaluated by using a UC image dataset. As a qualitative evaluation, we observe the images by csGAN and confirm that we can continuously control the visual severity level of the generated images. As a quantitative evaluation, we confirm that our continuous DA helps to improve the severity regression performance.
Our main contributions are summarized as follows:
- •
We propose csGAN, which can generate images at real-valued severity levels.
- •
We also propose to use dataset-disjoint multi-objective optimization for the disease severity regression task with an augmented dataset.
- •
Experimental evaluations with a UC image dataset show the performance superiority of our continuous DA scheme using the above two techniques over a baseline and other cGAN-based DA.
2 Related Work
Conditional GANs: Various cGANs have been proposed so far, [1, 2, 3, 4] and they assume various types of conditions. For example, in the pix2pix [5] framework, an image is given as a condition. The most common condition is class labels – they can be given as a one-hot vector or discrete number. In other words, for specifying the target type of generated images, it is not common to give a condition by a real-valued number (such as 1.33 and 0.28). Exceptionally, CcGAN [6] accepts real-valued conditions; however, it relies on a hard assumption that real-valued annotation has already been attached to each training sample. In contrast, our csGAN can be trained with discrete conditions but still can generate images at real-valued conditions.
DA for medical images: Due to a high cost for annotation, medical image analysis tasks often suffer from a limited number of labeled data and thus employ DA methods. According to a survey paper in 2021 [7], basic augmentation techniques, such as linear and nonlinear geometric transformations and intensity level perturbations, are still the majority for medical image DA. However, the survey also shows that GAN-based DA methods have increased in recent papers (such as [8, 9, 10]). The above review for cGANs says that GAN-based DA for medical images has also not dealt with real-valued conditions. Moreover, to the authors’ best knowledge, the augmented dataset is simply merged with the original dataset without any special treatment.
3 Continuous data augmentation
This section describes two techniques in continuous DA, i.e., csGAN and dataset-disjoint multi-task optimization. The former is a new conditional GAN trained to generate images at real-valued severity levels. The latter is a technique to train the regression model by using the original dataset and the augmented dataset in different manners by considering the reliability of the severity levels of the augmented data.
3.1 Continuous Severity GAN (csGAN)
As noted in Section 1, we propose csGAN to generate images at real-valued severity levels. Inspired by StarGAN v2[11], csGAN comprises four modules: a mapping network , a generator , a style encoder , and a discriminator , as shown in Fig. 1(a). csGAN uses a style vector as a condition to generate images at the level , where . A different results in a different style vector and finally contributes to having a different generated image at the level . Hereafter, we often denote as for simplicity.
Due to the page limitation, we briefly summarize the roles of four modules , , , and :
- •
accepts a random vector and then outputs style vectors at once.
- •
accepts a real or generated (i.e., fake) image and the condition and then generates an image at .
- •
accepts a real or fake image with its level and then estimates its style vector while expecting the estimated vector is similar to input to .
- •
is a standard discriminator for real/fake decisions of .
Those modules are trained to achieve cycle consistency; a generated image for the level needs to satisfy the condition . (Note that is a real image at the level .) By this cycle consistency, we have a level- version of and a level- version of . Consequently, we have images at all levels, even from a single image at a certain level .
For generating images at real-valued levels, csGAN introduces an additional loss function, called a order loss, for :
[TABLE]
where . With the order loss, we expect that the style vectors from the same will have a linear property, that is,
[TABLE]
This linear property will allow us to consider a real-valued severity level , where . More specifically, we can derive the style vector for the real-valued level by the linear interpolation,
[TABLE]
As shown in Fig. 1 (a), at the test phase, we use to generate an image at a real-valued severity level . First, is obtained by the mapping network with a . Then, for a certain , is determined by Eq. (3). Finally, a level- version of an image is generated by .
3.2 Learning by Dataset-Disjoint Multi-Objective Optimization
As noted in Section 1, the severity level of the generated data is not very reliable. Especially, since we used a simple linear style vector interpolation of Eq. (3), we cannot guarantee that the generated data of the level has exact visual characteristics as the level . In other words, the level is not fully reliable as the absolute level.
However, is still reliable as a relative level; for a pair of real-valued levels and (where ), the generated images and are expected to show the same relative order in their severity levels, that is, . By training the model to satisfy this relative condition (instead of training to satisfy ), we can utilize the augmented data by csGAN in an appropriate manner.
Considering the above property of the generated data, we use dataset-disjoint multi-objective optimization scheme to train the regression model , as shown in Fig. 1 (b). Assume we have an original dataset with manually-annotated discrete severity levels and a generated image dataset at various real-valued levels . Then, the CNN-based regression model is trained with both datasets and in different usages. Since is reliable as an absolute level, the image in are used to train to satisfy . Here, we use a mean squared error loss . On the other hand, since is reliable as a relative level, images and in with the relative relationship are used to train to satisfy . Here, we use the loss function of ListNet [12], which is one of the most popular methods for learning-to-rank. These two loss functions are balanced by a hyperparameter, which is optimized by a validation set.
4 Experimental Results
4.1 Experimental Setup
Dataset: To evaluate the proposed method (continuous DA, C-DA in short), we used a dataset of UC endoscopic images collected from the Kyoto Second Red Cross Hospital. The dataset contains 10,265 images from 388 patients. All images are annotated with discrete Mayo scores by multiple experts and resized to 256 256 pixels. The distribution of Mayo scores is 6,678, 1,995, 1,395, and 197 images for Mayo 0, 1, 2, and 3, respectively. Note that Mayo 0 corresponds to the level and Mayo 3 to .
Fig. 2 shows several examples of endoscopic images for each Mayo score. Schroeder et al.[13] categorized the endoscopic findings of UC as follows: Mayo 0 is a normal or inactive disease, Mayo 1 is a mild disease (erythema, decreased vascular pattern, etc.), Mayo 2 is a moderate disease (marked erythema, erosions, etc.), Mayo 3 is a severe disease (spontaneous bleeding, ulceration, etc.).
We performed five-fold cross-validation. The dataset was divided into training, validation, and test sets at 60, 20, and 20%, respectively. The splittings were performed by random patient-disjoint sampling, and the class ratios for each set were maintained. Moreover, random oversampling was used to mitigate class imbalance in the training set.
Implementation: For csGAN, we used the same network structure and hyperparameter values (except that the number of iterations was 50,000) as the official implementation of StarGAN v2 [11]. For the regression model , we used DenseNet [14] pretrained on ImageNet [15] and Adam as the optimizer with the initial learning rate set to . The batch size was set to 64. The learning was stopped by the early stopping (no decrease in validation loss for 20 epochs).
Evaluation Metric: We quantitatively evaluated the effect of C-DA by the prediction performance of the Mayo score severity classification by . The prediction class (i.e., discrete Mayo level) of the images is determined by quantizing the model outputs into these neighboring discrete levels (e.g., 1.3 1). Since the dataset is substantially imbalanced in the number of images in each class, we mainly used the F1 score for the performance evaluation.
Comparative Methods: We compared the performance of the proposed DA method (C-DA) with three comparative methods: 1) Baseline, which is conventional regression, 2) Classical DA, which is the baseline with DA by a random combination of horizontal/vertical flipping and rotation, and 3) GAN-based DA, which is used to generate images by the original implementation of a cGAN, called StyleGAN2-ADA [16]. For 2) and 3), we will show the results with 5,000 generated images per class (i.e., 20,000 in total) because their validation F1 score was saturated even though we used more generated images.
In addition, as an ablation study of C-DA, we evaluated the classification performance of a method that uses the original images as (C-DA w/o GAN). We also performed C-DA under different severity intervals. Specifically, we examined and to generate 4, 7, and 13 images from a single , respectively. We used 250 randomly selected s and thus generated 1,000, 1,750, and 3,250 images for each . Note that the validation F1 scores were almost saturated at 250 s; this means that ours show faster saturations than the above conventional methods, which need 20,000 images () to saturate.
4.2 Qualitative Evaluation of Generation Images
Fig. 4 shows the generated images with and without order loss by csGAN. Each image was generated with from an original image at Mayo 0. With the order loss, the severity shifts smoothly between the generated images as the erythema becomes intense, and the semilunar folds gradually disappear as the severity increases. In contrast, without the order loss, the image generated at Mayo 1.5 shows large noises, and the severity is unclear. This observation confirms that the order loss has a stabilization effect of generating images at real-valued levels.
4.3 Classification Performance
Table 1 shows the classification performance of each method. Baseline and conventional DA methods had similar F1 scores, while C-DA () had a higher F1 score than the three comparison methods. These results indicate that the generated images with real-valued severity levels are more effective than conventional DAs. The following facts also confirm this effect. First, the F1 score of C-DA () was higher than that of C-DA w/o GAN. Second, F1 scores of C-DA () were even higher than that of C-DA ().
On the other hand, the results also show that image generation at is not very effective. As we noted before, the real-valued levels of the generated images are not completely reliable. Therefore, when becomes smaller, the difference between the neighboring levels (e.g., 0.25 and 0.5) becomes unreliable even as the relative levels. This fact indicates a limitation in generating images at real-valued levels, and at the same time, it proves the validity of our dataset-disjoint optimization strategy.
Fig. 4 shows box plots of the model output for test images of each Mayo score. Here, (a) is the regression (Baseline) and (b) the proposed DA (C-DA()). The horizontal and vertical axes correspond to the correct Mayo score and the model outputs, respectively. The overall model outputs of C-DA are a narrower interquartile range for each Mayo score than Baseline. Especially, the overlap between the interquartile ranges of Mayo 2 and Mayo 3 is decreased. Consequently, C-DA had better classification performance, even for minor classes with fewer images.
5 Conclusion
We proposed a continuous data augmentation (DA) scheme comprising two techniques: continuous severity GAN (csGAN) to generate medical images with real-valued severities and dataset-disjoint multi-objective optimization to utilize the generated images. Through qualitative and quantitative evaluations on an endoscopic ulcerative colitis (UC) image dataset, we confirmed that our DA scheme achieves higher F1 scores by utilizing appropriately generated images.
The current limitations of this work are as follows. First, our method is applicable to various tasks with real-valued conditions, and therefore we need to examine our method with other datasets. Second, our UC datasets only have discrete levels and thus could make our quantitative evaluation in a discrete manner. We will examine different performance evaluations if we find a medical dataset with reliable real-valued annotations.
6 Compliance with Ethical Standards
This study was performed in line with the principles of the Declaration of Helsinki. Ethical approval for this study was granted by the Ethics Committee of the Kyoto Second Red Cross Hospital.
7 Acknowledgments
This work was supported by JSPS KAKENHI, JP21K18312, and JST SPRING, Grant Number JPMJSP2136.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Andrew Brock, Donahue, Jeff Donahue, and Karen Simonyan, “Large Scale GAN Training for High Fidelity Natural Image Synthesis,” in International Conference on Learning Representations , 2019.
- 2[2] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena, “Self-Attention Generative Adversarial Networks,” in Proceedings of the 36th International Conference on Machine Learning , 2019, pp. 7354–7363.
- 3[3] Augustus Odena, Christopher Olah, and Jonathon Shlens, “Conditional Image Synthesis with Auxiliary Classifier GA Ns,” in International Conference on Machine Learning , 2017, pp. 2642–2651.
- 4[4] Jie Gui, Zhenan Sun, Yonggang Wen, Dacheng Tao, and Jieping Ye, “A Review on Generative Adversarial Networks: Algorithms, Theory, and Applications,” IEEE Transactions on Knowledge and Data Engineering , pp. 1–1, 2021.
- 5[5] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros, “Image-to-Image Translation with Conditional Adversarial Networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2017, pp. 1125–1134.
- 6[6] Xin Ding, Yongwei Wang, Zuheng Xu, William J Welch, and Z Jane Wang, “Cc GAN: Continuous Conditional Generative Adversarial Networks for Image Generation,” in International Conference on Learning Representations , 2021.
- 7[7] Phillip Chlap, Hang Min, Nym Vandenberg, Jason Dowling, Lois Holloway, and Annette Haworth, “A Review of Medical Image Data Augmentation Techniques for Deep Learning Applications,” Journal of Medical Imaging and Radiation Oncology , vol. 65, no. 5, pp. 545–563, 2021.
- 8[8] Yunpeng Wang, Lingxiao Zhou, Mingming Wang, Cheng Shao, Lili Shi, Shuyi Yang, Zhiyong Zhang, Mingxiang Feng, Fei Shan, and Lei Liu, “Combination of Generative Adversarial Network and Convolutional Neural Network for Automatic Subcentimeter Pulmonary Adenocarcinoma Classification,” Quantitative Imaging in Medicine and Surgery , vol. 10, no. 6, pp. 1249–1264, 2020.
