Joint Learning of Blind Super-Resolution and Crack Segmentation for Realistic Degraded Images
Yuki Kondo, Norimichi Ukita

TL;DR
This paper introduces a joint deep learning framework that simultaneously enhances low-resolution, degraded images and segments cracks, improving accuracy in realistic scenarios with unknown image degradations.
Contribution
It presents a novel end-to-end joint learning approach for blind super-resolution and crack segmentation, including new network paths for mutual optimization.
Findings
Outperforms state-of-the-art segmentation methods
Effective in real-world degraded image scenarios
Ablation studies confirm the benefits of joint training
Abstract
This paper proposes crack segmentation augmented by super resolution (SR) with deep neural networks. In the proposed method, a SR network is jointly trained with a binary segmentation network in an end-to-end manner. This joint learning allows the SR network to be optimized for improving segmentation results. For realistic scenarios, the SR network is extended from non-blind to blind for processing a low-resolution image degraded by unknown blurs. The joint network is improved by our proposed two extra paths that further encourage the mutual optimization between SR and segmentation. Comparative experiments with State of The Art (SoTA) segmentation methods demonstrate the superiority of our joint learning, and various ablation studies prove the effects of our contributions.
| Class imbalance | Fine cracks | LR cracks | Blur | |
| 1. CSBSR | ||||
| 2. BC loss | ||||
| 3. Segmentation-aware SR-loss weights | ||||
| 4. Blur-reflected task learning |
| Segmentation metrics | SR metrics | |||||
| Model | IoU | AIU | HD95 | AHD95 | PSNR | SSIM |
| (a) Segmentation in HR | 0.616 | 0.559 | 6.20 | 11.89 | - | - |
| (b) SrcNet [12] | 0.368 | 0.320 | 95.16 | 130.47 | 27.82 | 0.639 |
| (c) DSRL [102] | 0.391 | 0.285 | 44.23 | 148.97 | 20.16 | 0.501 |
| (d) KBPN + PSPNet [124] | 0.548 | 0.524 | 28.45 | 31.62 | 28.62 | 0.706 |
| (e) CSBSR w/ [124] () | 0.573 | 0.552 | 20.92 | 22.52 | 28.75 | 0.703 |
| (f) KBPN + (HRNet+OCR [112]) | 0.522 | 0.501 | 26.45 | 28.74 | 28.68 | 0.706 |
| (g) CSBSR w/ [112] () | 0.553 | 0.534 | 17.54 | 20.29 | 27.66 | 0.668 |
| (h) KBPN + CrackFormer [69] | 0.447 | 0.424 | 46.86 | 58.91 | 28.68 | 0.706 |
| (i) CSBSR w/ [69] () | 0.469 | 0.443 | 39.37 | 56.59 | 25.93 | 0.571 |
| (j) KBPN + U-Net [87] | 0.470 | 0.455 | 45.26 | 45.94 | 28.68 | 0.706 |
| (k) CSBSR w/ [87] () | 0.530 | 0.506 | 26.33 | 27.24 | 28.68 | 0.702 |
| Segmentation metrics | SR metrics | ||||||
| Model | IoU | AIU | HD95 | AHD95 | PSNR | SSIM | |
| CSBSR w/ PSPNet | w/o joint learning | 0.548 | 0.524 | 28.45 | 31.62 | 28.62 | 0.706 |
| 0.1 | 0.563 | 0.541 | 19.16 | 21.96 | 28.73 | 0.705 | |
| 0.3 | 0.573 | 0.552 | 20.92 | 22.52 | 28.75 | 0.703 | |
| 0.5 | 0.572 | 0.550 | 18.80 | 21.18 | 28.69 | 0.701 | |
| 0.7 | 0.551 | 0.528 | 23.31 | 28.66 | 28.07 | 0.687 | |
| 0.9 | 0.554 | 0.533 | 26.03 | 27.29 | 27.72 | 0.669 | |
| 1.0 | 0.565 | 0.544 | 19.27 | 22.32 | 22.78 | 0.472 | |
| Increasing | 0.568 | 0.549 | 16.24 | 19.02 | 27.12 | 0.662 | |
| CSSR w/ PSPNet | w/o joint learning | 0.531 | 0.512 | 36.01 | 38.33 | 27.85 | 0.667 |
| 0.1 | 0.547 | 0.529 | 24.45 | 28.27 | 28.42 | 0.653 | |
| 0.3 | 0.475 | 0.446 | 53.75 | 55.96 | 28.47 | 0.663 | |
| 0.5 | 0.546 | 0.523 | 22.12 | 24.61 | 28.39 | 0.657 | |
| 0.7 | 0.557 | 0.539 | 21.20 | 24.74 | 28.35 | 0.656 | |
| 0.9 | 0.552 | 0.534 | 20.88 | 22.48 | 28.01 | 0.653 | |
| 1.0 | 0.539 | 0.515 | 21.82 | 26.04 | 20.29 | 0.436 | |
| Increasing | 0.544 | 0.512 | 28.28 | 35.30 | 27.02 | 0.635 | |
| Segmentation metrics | SR metrics | |||||
| Model | IoU | AIU | HD95 | AHD95 | PSNR | SSIM |
| BC loss (Ours) | 0.573 | 0.552 | 20.92 | 22.52 | 28.75 | 0.703 |
| GBC loss (Ours) | 0.551 | 0.534 | 23.34 | 33.46 | 28.70 | 0.705 |
| WCE [26] | 0.569 | 0.459 | 16.91 | 26.29 | 28.60 | 0.704 |
| Dice [77] | 0.466 | 0.465 | 59.21 | 59.65 | 28.66 | 0.704 |
| Combo [98] | 0.483 | 0.436 | 39.48 | 62.27 | 28.51 | 0.697 |
| Boundary [55] + GDice [97] | 0.469 | 0.425 | 65.13 | 68.90 | 28.31 | 0.692 |
| Segmentation metrics | SR metrics | |||||
| Model | IoU | AIU | HD95 | AHD95 | PSNR | SSIM |
| CSBSR | 0.573 | 0.552 | 20.92 | 22.52 | 28.75 | 0.703 |
| w/ | 0.558 | 0.535 | 19.72 | 22.90 | 27.32 | 0.649 |
| w/ () | 0.553 | 0.531 | 19.21 | 26.02 | 28.70 | 0.703 |
| w/ () | 0.573 | 0.551 | 18.73 | 21.70 | 28.73 | 0.702 |
| w/ () for | 0.556 | 0.531 | 22.26 | 25.94 | 28.70 | 0.706 |
| Segmentation metrics | SR metrics | ||||||
| Model | IoU | AIU | HD95 | AHD95 | PSNR | SSIM | Kernel PSNR |
| CSBSR | 0.573 | 0.552 | 20.92 | 22.52 | 28.75 | 0.703 | 50.65 |
| CSBSR w/ KS | 0.544 | 0.523 | 28.86 | 32.02 | 28.52 | 0.696 | 50.82 |
| CSBSR w/ KS and | 0.550 | 0.528 | 18.06 | 19.10 | 28.65 | 0.702 | 50.91 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntegrated Circuits and Semiconductor Failure Analysis · Image Processing Techniques and Applications · Infrastructure Maintenance and Monitoring
[ orcid=0000-0002-5263-8722 ]
\credit
<Credit authorship details>[ orcid=0000-0002-0240-1065 ]
\cortext
[cor]Corresponding author
tti]organization=Toyota Technological Institute, addressline=2-12-1 Hisakata, Tempaku-ku, city=Nagoya, postcode=468-8511, state=Aichi, country=Japan
Joint Learning of Blind Super-Resolution and Crack Segmentation for
Realistic Degraded Images
Yuki Kondo [email protected]
Norimichi Ukita [email protected] [
Abstract
This paper proposes crack segmentation augmented by super resolution (SR) with deep neural networks. In the proposed method, a SR network is jointly trained with a binary segmentation network in an end-to-end manner. This joint learning allows the SR network to be optimized for improving segmentation results. For realistic scenarios, the SR network is extended from non-blind to blind for processing a low-resolution image degraded by unknown blurs. The joint network is improved by our proposed two extra paths that further encourage the mutual optimization between SR and segmentation. Comparative experiments with SoTA segmentation methods demonstrate the superiority of our joint learning, and various ablation studies prove the effects of our contributions.
keywords:
Crack segmentation\sepImage processing\sepBlind super resolution\sepJoint learning \sepMulti-task learning
1 Introduction
While many constructions and infrastructures such as buildings, pavements, bridges, and tunnels are dilapidated in the world, it is difficult to always manually inspect all of them. Instead of the manual inspection, automatic inspection is one of the prospective solutions for efficiently diagnosing these constructions. While such inspection can be achieved by several types of sensors such as the Falling Weight Deflectometer, the Pavement Density Profiler, and the Ground Penetrating Radar, this paper focuses on crack segmentation on images captured by generic cameras for visual inspection.
Crack segmentation [31] is defined to be binary semantic segmentation in the field of computer vision. While the number of classes in crack segmentation (i.e., two classes) is much fewer than the one of recent generic segmentation tasks, such as scene segmentation [53, 70] and aerial image segmentation [63], real-world crack segmentation is not an easy problem even with recent powerful deep neural networks. This is because of the following reasons:
- A.
High class-imbalance: The number of crack pixels is much less than the number of non-crack pixels (i.e., background pixels), as shown in Fig. 1 (b). In such a problem, all pixels tend to be classified to background. 2. B.
Fine cracks: Cracks can be hairline, which are difficult to be segmented, as shown in Fig. 1 (b). 3. C.
Low-Resolution (LR): For inspection of various structures such as tunnels [96, 11], pavements [32, 5], and bridges [84, 25], an inspection camera captures cracks in LR (as shown in Fig. 1 (a)) because it cannot get close to the structures for safety reasons. 4. D.
Cracks in blurred images: Since inspection images are usually captured from moving vehicles such as cars and drones for efficient inspection, those images can be blurred, as shown in Fig. 1 (a).
While even each of the aforementioned problems is not an easy problem, crack segmentation is more challenging due to the combination of all of these problems, even with SoTA methods, as shown in Fig. 1 (c) and (d). To cope with these problems, this paper proposes a unified framework consisting of the following novel contributions (Table 1):
Crack Segmentation with Blind Super-Resolution (CSBSR): As with Crack Segmentation with Super Resolution (CSSR) proposed in our earlier conference paper [60], CSBSR proposed in this paper connects “a network for Super Resolution (SR) accepting an input LR image” in series to “a segmentation network” for end-to-end joint learning. We extend CSSR to CSBSR with blind SR to handle realistically-blurred images. Our joint learning of blind SR and segmentation allows us to optimize SR for improving segmentation (Fig. 1 (e)) more than similar methods [12, 102] using both SR and segmentation (Fig. 1 (c) and (d)). 2. 2.
Boundary Combo (BC) loss for segmentation: In addition to super-resolving tiny cracks as mentioned above, fine boundaries are locally evaluated with global constraints in the whole image for detecting fine cracks robustly to the class-imbalance problem. 3. 3.
Segmentation-aware SR-loss weights: While CSSR and CSBSR use BC loss to train not only the segmentation network but also the SR network in and end-to-end manner, the SR network is less optimized due to gradient vanishing through the segmentation network. To train the SR network more for segmentation, BC loss directly weights a loss for SR. For further improvement, the SR loss is also weighted by additional weights based on fine-crack and hard-negative pixels. 4. 4.
Blur skip for blur-reflected task learning: Since an SR image is imperfect, blur effects remaining in the SR image give a negative impact on segmentation. For segmentation more robustly to the blur effects, the blur estimated in SR is provided to the segmentation network via a skip connection.
Our code is available at https://github.com/Yuki-11/CSBSR.
2 Related Work
2.1 Image Segmentation
Image segmentation techniques [78] are briefly divided into three categories, namely semantic segmentation [90], instance segmentation [47], and panoptic segmentation [59]. Crack segmentation is categorized into semantic segmentation because it classifies all pixels into crack and background pixels with no instance. That is, these crack pixels are not divided into crack instances.
Class-imbalance Segmentation: As well as in various computer vision problems, in image segmentation, class imbalance is a critical problem. Many approaches for class imbalance are applicable to class-imbalance segmentation tasks. For example, weighted loss such as the Weighted Cross Entropy (WCE) loss [26] and the focal loss [67] for segmentation [64, 49, 109], re-sampling [126] for segmentation [18], and hard mining [29] for segmentation [35].
Among all segmentation tasks, medical image segmentation has to cope with highly-imbalanced classes (e.g., tiny tumors and background). Such difficult medical image segmentation is tackled by a variety of loss functions such as the Dice loss [77], the Generalized Dice loss [97], the Combo loss [98], the Hausdorff loss [54], and the Boundary loss [55].
Crack Segmentation: Since the class-imbalance issue is important also for crack segmentation as presented as Problem <A> in Table 1, the aforementioned schemes proposed against class imbalance are useful for crack segmentation. For example, in order to balance the number of samples between classes, Crack GAN [119] oversamples crack images by using DC-GAN [85]. The Dice, Combo, and WCE losses are employed for crack segmentation in [86], in [19], and in [71, 128, 69], respectively.
In addition to the class-imbalance issue, the fine boundaries of cracks are not easy to be extracted and make crack segmentation difficult, as presented as Problem <B> in Table 1. For such difficult fine crack segmentation, the aforementioned schemes proposed against class imbalance (e.g., weighted loss, re-sampling, class-imbalance-oriented loss) are also useful. Previous methods for such fine cracks are divided into the following two approaches, namely boundary-based and coarse-to-fine weighting.
In the boundary-based approach, the distance between the boundaries of ground-truth and predicted cracks is minimized. In [54], the Hausdorff distance is evaluated by using the distance transform. While its computational cost for the exact solution is high, the sum of L2 distances is approximated by the sum of regional integrals for efficiency in the Boundary loss [55].
Various coarse-to-fine weighting approaches such as [20] employ pyramid and U-net like networks for weighting a fine but unreliable representation by more reliable results in a coarse representation. The effectiveness of this approach is validated also in crack segmentation [107, 128, 71, 69, 22, 66].
While the effectiveness of the both approaches is validated, the coarse-to-fine weighting approach is applicable only to pyramid and U-net like architectures. On the other hand, the boundary-based approach can be employed with any other loss functions in any network architectures in general.
2.2 Super Resolution (SR)
Non-blind SR: SR reconstructs a High-Resolution (HR) image, , from its LR image, . The image degradation process from HR to LR is modeled as follows:
[TABLE]
where , , , and denote a blur kernel, a convolution operator, a downsampling process, and a scaling factor, respectively. By downscaling HR training images to their LR images by a known downsampling process, such as bicubic interpolation, we can have a set of and for training a non-blind SR model.
Such non-blind SR is developed with various aspects [101, 17, 37, 116, 105] such as arbitrary image degradations [117, 118, 115], attention mechanisms [24, 81, 76], recurrent/iterative networks [65, 61, 43], stochastic generation [89, 13, 73], and reference-based SR [122, 92, 72]. However, since the image degradation process is assumed to be known in all of these non-blind SR methods, their performance is decreased in real-world images with arbitrary unknown degradations.
Blind SR: To apply SR to arbitrarily-degraded images, blur kernel is employed in blind SR. Even without modeling in a SR network, blind SR can be done by blurring training images [95, 125, 83, 48, 114] or by deblurring input images [52] by . In the kernel conditioning approach [23, 36, 74, 51, 103], a blur representation estimated from an input LR image is fed into a SR network for conditioning the SR process by the blur. While this kernel conditioning employs low-dimensional blur representations for efficiency and stability in general, the original blur kernel, , is modeled within a SR network for further accuracy in [38, 57].
Since the blur kernel is more informative than its low-dimensional representation, the blur kernel can be useful for additional tasks using a SR image. As such an additional task, image segmentation is done in our work.
2.3 Joint Learning of SR and Other Tasks
With upscaled SR images, a variety of applications can be realized. For example, distant-object detection [40, 50, 33] and segmentation [39], remote sensing [10, 93], wide-angle image analysis[21, 8], and cell image analysis [68, 79]. As with these examples, crack segmentation can be also supported by SR [12] for detecting blurred LR cracks presented in Problems <C> and <D> in Table 1.
While these methods have models for SR and another task (e.g., segmentation) separately (Fig. 2 (a)), these tasks can be jointly trained in a single model for supporting the additional task more explicitly. Such joint end-to-end learning is also applicable in a variety of tasks such as classification [99, 94] and detection (e.g., faces [14], pedestrians [82, 121], vehicles [7], and generic objects [62, 15, 44]).
Image segmentation can be also improved by combining with SR. As shown in Fig. 2 (b), DSRL [102] applies multi-task learning to the non-blind SR and segmentation tasks so that a single feature extractor is shared by the parallel SR and segmentation branches following the extractor. While multi-task learning may improve both tasks, the SR and segmentation branches are independently trained.
3 Joint Learning of Blind SR and Crack Segmentation
While methods using joint end-to-end learning with SR [99, 94, 14, 82, 121, 62, 15, 44, 102] mentioned in Sec. 2.3 are close to our work, it is difficult to apply them to crack segmentation in realistic scenarios. This is because these methods using non-blind SR cannot cope with unknown blurs observed in images degraded by out-of-focus and motion blurs. Our CSBSR resolves this problem by employing blind SR in joint learning (Sec. 3.1). For further coping with the class-imbalance issue, this paper also proposes a new combination of loss functions for class-imbalance fine segmentation (Sec. 3.2). In addition to joint learning, we propose loss weighting for optimizing segmentation more for SR in Sec. 3.3 and extra skip connection paths for optimizing SR more for segmentation, as described in Secs. 3.4.
3.1 Joint Learning
CSBSR consists of blind SR and segmentation networks, as shown in Fig. 2 (c). Its detail is shown in Fig. 3. The blind SR network, where denotes all the parameters of this SR network, maps to its SR image, . The crack segmentation network, , takes and outputs a crack segmentation image . Any differentiable SR and crack segmentation networks can be employed as and , respectively. Let and denote loss functions for and , respectively. While is back-propagated through , is back-propagated through and in an end-to-end manner. The whole network is trained by the following loss with the task weight as a hyper-parameter:
[TABLE]
Implementation details
In our experiments, DBPN [45, 43] and its extension to blind SR [111], which is called KBPN, are employed as for fair comparison between our proposed methods with non-blind SR and blind SR (i.e., comparison between CSSR and CSBSR). Different from DBPN as non-blind SR, KBPN also outputs its estimated blur kernel. Loss functions used in [45, 43] and KBPN [111] are used as in our joint learning with no change.
is implemented with each of U-Net [87], PSPNet [124], CrackFormer [69], and HRNet+OCR [112] for validating a wide applicability of our method111The implementations of these SR and segmentation networks are publicly available [41, 110, 1, 123, 2, 113].. Section 3.2 proposes a new general-purpose segmentation loss, which is applicable to all of these networks as .
3.2 Boundary Combo Loss
For suppressing class-imbalance difficulty in crack segmentation, we propose the Boundary Combo (BC) loss that simultaneously achieves locally-fine and globally-robust segmentation. Fine segmentation can be achieved by the boundary-based approach such as the Boundary loss [55]. However, if only the boundary-based approach is employed, the segmentation network is easy to fall into local minima, as validated in [55]. This problem can be resolved by employing the boundary-based approach simultaneously with a loss that evaluates the whole image region. In [55], the Generalized Dice (GDice) loss [97] is empirically demonstrated to be a good choice. However, it is reported that the Sigmoid function included in the GDice loss and its original Dice loss tends to cause the vanishing gradient problem [98].
This paper explores more appropriate losses combined with the Boundary loss for stable learning as well as fine segmentation. We improve learning stability by combining the GDice loss with the WCE loss that is expressed without the derivative of the Sigmoid function, which tends to cause gradient vanishing. Since the Dice loss and the WCE loss have different properties (i.e., which are categorized to region-based and distribution-based losses, respectively, as introduced in [75]), it is also validated that a pair of the Dice and WCE losses, which is called the Combo loss [98], complementarily work for better segmentation. Finally, we propose the following Boundary Combo (BC) loss, , as for in our joint learning:
[TABLE]
where , and denote the Boundary, Dice, and WCE losses, respectively. and are hyper-parameters. consists of the region, distribution, and boundary-based losses. A combination of these three loss categories are never evaluated according to the survey [75]. As a variant of , we also propose in which the GDice loss is used in instead of .
While one may refer to the original papers of , , , and for the details, these losses are briefly explained in the following three paragraphs.
Boundary loss ()
The Boundary loss [55], computes the distance-weighted 2D area between the ground-truth crack and its estimated one, which becomes zero in the ideal estimation, as follows:
[TABLE]
where and denote the pixel sets of the ground-truth crack and its estimated one, respectively. and denote a point on boundary and its corresponding point on boundary , respectively. is an intersection between and a normal of at . is the mismatch part between and . is the distance map from . and are binary indicator functions, where and if and , respectively. is the level set representation of boundary : if , and otherwise. denotes a pixel set in the image. The second term in Eq. (5) is omitted as it is independent of the network parameters. By replacing by the network softmax outputs , we obtain the Boundary loss function below:
[TABLE]
Dice and GDice losses ( and )
The Dice loss [77] is a harmonic mean of precision and recall as expressed as follows:
[TABLE]
where and denote the number of classes (i.e., in our problem) and the number of all pixels in each image, respectively. and are the classification probability () and its ground truth ().
Different from the Dice loss, the GDice loss [97] is weighted by the number of pixels in each class as follows:
[TABLE]
where .
WCE loss ()
The WCE loss [26] is the Cross Entropy loss weighted by a hyper parameter, , which is determined based on the class imbalance (e.g., where and is the number of all training images):
[TABLE]
3.3 Segmentation-aware Weights for SR
In addition to end-to-end learning with (i.e., segmentation loss in Eq. (2)), we propose to weight by for further optimizing the SR network for segmentation. This weighting is achieved by pixelwise multiplying by .
It is not yet easy to discriminate between crack and background pixels for precisely detecting fine cracks. This difficulty arises especially around crack pixels. For such difficult pixelwise segmentation, our method employs the following two difficulty-aware weights:
- •
For detecting all fine thin cracks, a segmentation loss function is weighted so that pixels inside and around cracks are weighted higher. A weight given to pixel , , is expressed as follows:
[TABLE]
where and denote a weight constant and a distance between and its nearest crack pixel, respectively. is called the Crack-Oriented (CO) weight.
- •
For hard pixel mining, a segmentation loss function is weighted so that pixels inside and around false-positive and false-negative pixels are weighted higher. For such difficulty-aware segmentation, in our method, a weight given to pixel , , is expressed as follows:
[TABLE]
where and denote the value of -th pixel in predicted and ground-truth segmentation images, respectively. is a weight constant. Our is applicable to any loss function such as our BC loss, Eq. (3), consisting of multiple loss functions, while the focal loss [67] and the anchor loss [88], both of which also penalize hard samples, are based on a weighted cross entropy loss. is called the Fail-Oriented (FO) weight.
These two weights (9) and (10) are multiplied pixelwise by .
3.4 Blur Skip for Blur-reflected Task Learning
It is not easy for the blind SR network to perfectly predict the ground-truth blur kernel and the ground-truth HR image so that . Let and denote the predicted kernel and the blur kernel that remains in so that and . We assume that correlates with .
Based on this assumption, this paper proposes blur-reflected segmentation learning via a skip connection, which is called the blur skip, from the SR network to the segmentation network . This skip connection forwards to the end of in order to condition features extracted by with . While this conditioning is achieved by the Spatial Feature Transform (SFT) [104], SFT is marginally modified for CSBSR as follows. The detail of the modified SFT layer is shown in Fig. 4. In the original SFT layer, conditions are directly fed into conv layers for producing conditioning features (which are depicted by red and yellow 3D boxes, respectively, in Fig. 4) for scaling and shifting. Different from this original SFT layer, target features (“Segmentation features” in Fig. 4) are concatenated to the conditions. It is empirically validated that this concatenation process slightly improves the segmentation quality.
3.5 Training Strategy
Our joint learning has several loss functions, weights, and hyper-parameters. They should be properly used for training our complex network consisting of and .
Step 1:
As with most tasks each of which has a limited amount of training data, is pre-trained with general huge datasets for blind SR.
Step 2:
With a dataset for crack segmentation, only is initially finetuned with in Eq .(2).
Step 3:
The whole network is finetuned so that is weighted by a constant (i.e., ).
4 Experimental Results
4.1 Pre-training and Training Details
For pre-training the SR network , 3,450 images in the DIV2K dataset [6] (800 images) and the Flickr2K dataset [100] (2,650 images) were used. The whole network for crack segmentation was not pre-trained but its feature extractor was pre-trained with the ImageNet [28].
For pre-training (i.e., Step 1 in Sec. 3.5) and finetuning and (i.e., Steps 2 and 3), an image patch fed into each network is randomly cropped with vertical and horizontal flips from each training image for data augmentation. This patch is regarded as a HR image (). From , its LR images () are generated with various blur kernels () and bicubic downsampling (), as expressed in Eq. (1). is randomly sampled from the anisotropic 2D Gaussian blurs with variance and angle . The kernel size is pixels. The HR-LR downscaling factor is . The feature extractor of is pre-trained depending on the segmentation network as follows. For U-Net and PSPNet, VGG-16 is provided by torchvision [4]. For HRNet+OCR, the authors’ model [113] is used.
For pre-training of in Step 1, the number of iterations is 200,000. The minibatch size is six. Adam [58] is used as an optimizer with . The learning rate is .
The number of iterations is 30,000 and 150,000 in Steps 2 and 3, respectively. The minibatch size and the optimizer are equal to those in the aforementioned pre-training. The learning rate is .
4.2 Synthetically-degraded Crack Images
4.2.1 Training
For experiments shown in Secs. 4.2 and 4.3, the Khanhha dataset [56] was used to finetune the whole network for CSBSR. the SR and segmentation networks. This dataset consists of CRACK500 [120], GAPs [30], CrackForest [91], AEL [9], cracktree200 [127], DeepCrack [71], and CSSC [108] datasets. As shown in the sample images of these datasets, (Fig. 5), the Khanhha dataset is challenging so that a variety of structures are observed and the properties of annotated cracks differ between the elemental datasets [120, 30, 91, 9, 127, 71, 108]. In the Khanhha dataset, the image size is pixels, which is regarded as a HR image in our experiments. The dataset has 9, 122, 481, and 1,695 training, validation, and test images. These training and test sets were used as training images for all experiments and test images in experiments shown in Sec. 4.2, respectively.
4.2.2 Evaluation Metrics
Each SR image is evaluated with PSNR and SSIM [106]. Each segmentation image is evaluated with Intersect of Union (IoU). While IoU is computed in a binarized image, the output of CSBSR is a segmentation image in which each pixel has a probability of being a crack or not a crack. Since IoU differs depending on a threshold for binarization, the threshold for each method is determined so that the mean IoU over all test images is maximized. This maximized IoU is called IoUmax. For evaluation independently of thresholding, IoUs are averaged over thresholds (AIU [107]). While IoU is a major metric for segmentation, it is inappropriate for evaluating fine thin cracks because a slight displacement makes IoU significantly small even if the structures of ground-truth and estimated cracks are almost similar. For appropriately evaluating such similar cracks, 95% Hausdorff Distance (HD95) [27] is employed. As with IoU, the HD95 threshold for each method is also determined so that the mean HD95 over all test images is minimized. This minimized HD95 is called HD95min. For evaluation independently of thresholding, HD95s are also averaged over thresholds. This averaged HD95 is called AHD95.
4.2.3 Comparison with SOTA segmentation methods
For comparative experiments, 1,695 HR test images in the Khanhha dataset are degraded to their LR images in the same manner as training image generation.
For validating the wide applicability of CSBSR, four SOTA segmentation networks (i.e., PSPNet [124] for Table 2 (e), HRNet+OCR [112] for Table 2 (g), CrackFormer [69] for Table 2 (i), and U-Net [87] for Table 2 (k)) are used as a segmentation network in CSBSR, as described in Sec. 3.1. While CSBSR is trained in a joint end-to-end manner (i.e., (e), (g), (i), (k) in Table 2), the results of independent blind SR and segmentation networks (i.e., (d), (f), (h), (j) in Table 2) are also shown for comparison. To focus on the difference between the network architectures for segmentation, all of these segmentation networks are trained with our BC loss in Eq. (3). In BC loss, and were determined empirically. The task weight in Eq. (2) is determined empirically for each method and fixed during Step 3 in the training strategy (Sec. 3.5).
In addition, CSBSR is compared with SOTA methods in which non-blind SR and segmentation are used (i.e., Table 2 (b) SrcNet [12] in which SR and segmentation are trained independently and Table 2 (c) DSRL [102] in which SR and segmentation are trained in a multi-task learning manner). The segmentation network of SrcNet and DSRL is trained with the BCE loss. While SrcNet is implemented by ourselves because its code is not available, we used the publicly-available implementation of DSRL [3].
Quantitative Results: Table 2 shows quantitative results. In all metrics, all variants of CSBSR are better than their original segmentation methods. That is, (e), (g), (i), and (k) are better than (d), (f), (h), and (j), respectively, in Table 2. As a result, CSBSR is the best in all segmentation metrics (i.e., IoU, AIU, HD95, and AHD95).
Our proposed methods are also compared with SoTA segmentation methods using SR (i.e., (b) and (c) in Table 2). The performance improvement of CSBSR compared to SrcNet might be acquired by BC loss, joint learning, and/or blind SR. In comparison between CSBSR and DSRL, we can see the effectiveness of serial joint learning, as well as BC loss and blind SR.
Even in comparison with (a) segmentation in HR images (implemented by PSPNet with BC loss), the segmentation scores of CSBSR get close to those of segmentation in HR. For example, IoU and AIU of CSBSR with PSPNet are 93.0% and 98.7% of those of segmentation in HR. In terms of HD95, on the other hand, CSBSR is much inferior to segmentation in HR. This reveals that CSBSR should be improved more in order to extract fine crack structures.
The IoU and HD95 scores of our proposed method with CSBSR are shown in Fig. 7. For comparison, our method with non-blind SR (i.e., CSSR) and SOTA segmentation methods using SR are compared with CSBSR. As the upper limitation, the scores of segmentation on ground-truth HR images are also shown as (a) in Fig. 7, while LR images are fed into all other methods (b), (c), (d), (e), and (f) in Fig. 7.
It can be seen that (b) SrcNet and (c) DSRL are clearly inferior to others in both IoU and HD95. In particular, the scores of DSRL are significantly changed depending on a change in the threshold. This reveals that DSRL is sensitive to a change in the threshold. The scores of all other methods accepting LR images are close to those of (a) segmentation in HR images. In particular, (f) CSBSR can get higher scores in a wide range of the thresholds. This stability against a change in the threshold is crucial in applying CSBSR to a variety of segmentation tasks.
Visual Results: Figure 7 shows visual results. In the upper row, from left to right, the first and second images are an input LR image (enlarged by nearest neighbor interpolation) and its ground-truth HR image. The remaining three images are SR images of SrcNet, DSRL, and CSBSR. It can be seen that the SR image of CSBSR is much sharper than those of SrcNet and DSRL. In terms of the crack segmentation image also, CSBSR outperforms SrcNet and DSRL.
Figure 8 shows the examples of more complex cracks. Since such complex crack pixels make it difficult to correctly detect these pixels, even segmentation methods using SR reconstruction (i.e., SrcNet [12] and DSRL [102]) cannot detect many crack pixels, as shown in Fig. 8 (d) and (e). As shown in Fig. 8 (f), on the other hand, our CSBSR can obtain crack segmentation images that are similar to their corresponding segmentation images obtained in the original HR images shown in Fig. 8 (c). It can also be seen that CSBSR can reconstruct and detect even thin fine cracks in the SR image and segmentation images, respectively. As a result, our results are similar to the ground-truth segmentation images shown in Fig. 8 (b).
Figure 9 shows examples where (f) the SR segmentation image obtained by CSBSR is better even than (c) the HR segmentation image obtained in the ground-truth HR image. These images are characterized by low image-contrast around crack pixels, thin cracks, and/or local illumination change around crack pixels.
We interpret that, while it is difficult for SR to reconstruct and for segmentation to detect such high-frequency structures and low-contrast structures shown in Figs. 8 and 9, our joint learning of SR and segmentation with the segmentation-aware SR loss and the blur skip for blur-reflected segmentation learning can achieve these difficult tasks.
Figure 10 shows sample test images where no crack pixels are observed. While there are no crack pixels in these images, observed masonry joints tend to be false-positives. For real applications using automatic image inspection, it is important to successfully suppress such false-positives for avoiding false alarms because most images have no crack pixels in real buildings. In Fig. 10, it can be seen that (d) SrcNet and (e) DSRL detect false-positives around the masonry joints, while (f) CSBSR successfully neglects all of these masonry joint pixels.
4.2.4 Effects of
Table 3 shows the evaluation results obtained in accordance with changes in . In all metrics of both SR and segmentation tasks, CSBSR outperforms CSSR. Furthermore, in both CSSR and CSBSR, our proposed joint learning acquires better results in all segmentation metrics.
More specifically, in terms of the segmentation results, IoUmax and AIU are not so changed depending on . On the other hand, the best HD95min and AHD95 scores are better in the training strategy with increasing (i.e., “Increasing” in the table) and have a larger margin from the scores obtained with any fixed . Intuitively speaking, the segmentation score should be best with so that the segmentation loss (i.e., in Eq. 2) is fully weighted. We interpret that the segmentation scores are not best with because it is difficult to fully optimize the whole network directly from the pre-trained SR and segmentation networks. That is why the training strategy with increasing is better than .
In terms of the SR image quality, While the best SSIM is acquired without joint learning, the best PSNR is with . Since the SR network is trained without joint learning just to improve SR, it is expected that the best SR results are obtained without joint learning. This expectation is betrayed probably because of the feature extractor augmentation through the training of the segmentation task. The features can be marginally augmented also for SR as in multi-task learning if is smaller, while the features are optimized for the segmentation task if is larger.
4.2.5 Effects of Segmentation losses
To verify the effectiveness of our BC and GBC losses, CSBSR is trained with other losses for class-imbalance segmentation (i.e., WCE [26], Dice [77], Combo [98], and GDice [97]). As shown in Table 4, BC loss gets the best scores in four metrics (i.e., IoU, AIU, AHD95, and PSNR) and the second-best in HD95. While it is the third place in SSIM, the gap from the best is tiny (0.705 vs 0.703).
Figure 11 shows IoU and HD95 scores varying with a change in a threshold for binarizing the segmentation image. As shown in Table 4, GBC is inferior to BC. However, GBC gets higher scores in a large range of thresholds in both IoU and HD95. This property might be given by GDice, included in GBC, which works robustly to class imbalance. On the other hand, while WCE gets better results in a few metrics in Table 4, its performance drop depending on the threshold is significant. This performance drop makes it difficult to apply WCE loss to a variety of scenarios. As with GBC, the curves of BC are also not so decreased.
Based on the aforementioned observations, we conclude that our BC and GBC loesses are superior to other SOTA losses in terms of the max performance (as shown in Table 4) and stability (as shown in Fig. 11).
4.2.6 Effects of Segmentation-aware SR-loss Weights
The effects of additional weights given to , which are proposed in Sec. 3.3, are evaluated in Table 5. Since and have hyper parameters (i.e., and , respectively), the best results among are shown in Table 5. We can see the following observations:
- •
All weights given to improve HD95.
- •
Conversely, all weights given to decrease IoU and AIU, while the performance drops are not so significant. In particular, IoU and AIU provided by given to are almost equal to those of the baseline CSBSR (i.e., 0.573 vs 0.573 in IoU and 0.551 vs 0.552 in AIU).
- •
While weights the segmentation loss (), the results are inferior to the baseline in most metrics, as shown in the bottom row of Table 5.
In addition to the quantitative comparison shown in Table 5, Fig. 12 visually shows the effect of the FO weight. All images are the results obtained with . In the left part of Fig. 12, we can see that allows CSBSR to detect thin crack pixels in segmentation images. In order to see the results of SR image enhancement by , the zoom-in images of several regions in the SR images are shown in the right part of Fig. 12.
In (c) images obtained without , detected crack pixels are broken. In (d) images obtained with , on the other hand, cracks are more continuously detected, though it is difficult to visually see any significant difference between zoom-in SR images shown in (c’) and (d’). In an opposite way, background textures enclosed by the purple dashed ellipse are falsely detected in CSBSR without , as shown in (c) of the lower example. However, these background pixels reconstructed by CSBSR without and with (enclosed by the purple dashed ellipses in (c’) and (d’)) are also almost the same as each other. These results demonstrate the effectiveness of for discriminating between remarkably-similar crack and background pixels in the segmentation network of CSBSR.
4.2.7 Effects of Blur Skip
The effects of the proposed blur skip process are shown in Table 6. Since the quality of the estimated kernel is high enough (e.g., above 50 dB in PSNR), our kernel skip should have the potential to support the segmentation task. While the single usage of the blur skip cannot work well for all metrics, the blur skip used with improves HD95 and AHD95. The typical examples are shown in Fig. 13. While the results without the blur skip are much inferior to their ground truths, the blur skip can improve the performance, as shown in the rightmost image in Fig. 13.
4.3 Crack Images with Real Degradations
For experiments with real images, we captured 809 wall images ( pixels) with a flying drone (DJI MAVIC MINI). This dataset includes out-of-focus images as well as motion-blurred images. By using all the images in this dataset as test images, we visually verify the effectiveness of CSBSR for realistically-blurred images. Since it is essentially difficult to annotate severely-blurred cracks correctly, only qualitative comparison is done with this dataset.
In the first row of Fig. 14, cracks are very thin. DSRL and SrcNet cannot detect any crack pixels. In addition, false-positive cracks (enclosed by yellow ellipses) are detected. CSBSR, on the other hand, can detect most crack pixels, as depicted by superimposed red pixels.
The second row of Fig. 14 shows the segmentation results detected on the image of complex cracks observed on a building wall. While DSRL detects no crack pixels, SrcNet and CSBSR successfully detect several crack pixels. CSBSR can detect more true-positive crack pixels, in particular, along a crack located in the upper part of the image (enclosed by blue ellipses). However, there are also many false-negative crack pixels (enclosed by green ellipses) even in the segmentation image of CSBSR.
In the input image shown in the third row of Fig. 14, there are thin electrical wires as well as thin cracks (enclosed by blue and green ellipses). A crack segmentation method is required to detect only real cracks without being disturbed by the wires. DSRL detects several wire pixels (enclosed by the yellow ellipse) and crack pixels, while SrcNet detects nothing. While CSBSR detects only crack pixels, even CSBSR fails to detect blurry cracks observed in the lower part of the image (enclosed by green ellipses).
As mentioned above, while our CSBSR outperforms SOTA segmentation methods using SR, it also fails to detect severely-degraded cracks. Improving crack segmentation in such severely-degraded images is important for future work.
5 Concluding Remarks
This paper proposes an end-to-end joint learning network consisting of blind SR and segmentation networks. Blind SR allows us to apply the proposed method to realistically-blurred images. The information exchange between the SR and segmentation networks (i.e., segmentation-aware SR-loss weights and blur skip for blur-reflected task learning) enables further improvement. For better segmentation in class-imbalance fine crack images, BC loss is proposed.
Future work includes quantitative evaluation on real-image datasets in which ground-truth segmentation pixels are manually given. It is also interesting to apply CSBSR to other segmentation tasks such as medical imaging. An essential difficulty in SR is that SR is an ill-posed problem in which a larger number of pixels are reconstructed from a smaller number of pixels. In order to relieve this difficulty, multiple LR images are used as a set of input images in video SR [34, 80, 46, 42] and burst SR [16]. Our proposed method can also be extended to the one with time-series images.
5.1 Acknowledgments
This work was partly supported by JSPS KAKENHI Grant Numbers 19K12129 and 22H03618.
6 Biography Section
\bio
photos/kondo.pdf Yuki Kondo received the bachelor degree in engineering from Toyota Technological Institute in 2022. Currently, he is a researcher with Toyota Technological Institute. His research interests include low-level vision including image and video super-resolution and its application to tiny image analysis such as crack detection. His award includes the best practical paper award in MVA2021. \endbio
\bio
photos/ukita.pdf Norimichi Ukita received the B.E. and M.E. degrees in information engineering from Okayama University, Japan, in 1996 and 1998, respectively, and the Ph.D. degree in Informatics from Kyoto University, Japan, in 2001. From 2001 to 2016, he was an assistant professor (2001 to 2007) and an associate professor (2007-2016) with the graduate school of information science, Nara Institute of Science and Technology, Japan. In 2016, he became a professor with Toyota Technological Institute, Japan. He was a research scientist of Precursory Research for Embryonic Science and Technology, Japan Science and Technology Agency, during 2002–2006, and a visiting research scientist at Carnegie Mellon University during 2007–2009. Currently, he is also an adjunct professor at Toyota Technological Institute at Chicago. Prof. Ukita’s awards include the excellent paper award of IEICE (1999), the winner award in NTIRE 2018 challenge on image super-resolution, the 1st place in PIRM 2018 perceptual SR challenge, the best poster award in MVA2019, and the best practical paper award in MVA2021. \endbio
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Crack segmentation. https://github.com/khanhha/crack_segmentation .
- 2[2] Crackformer-ii. https://github.com/Louis NUST/Crack Former-II .
- 3[3] Dual super-resolution learning for semantic segmentation. https://github.com/Dootmaan/DSRL .
- 4[4] Torchvision.models. https://pytorch.org/vision/stable/models.html .
- 5[5] Allen Zhang abd Kelvin C. P. Wang, Yue Fei, Yang Liu, Siyu Tao, Cheng Chen, Joshua Q. Li, and Baoxian Li. Deep learning–based fully automated pavement crack detection on 3d asphalt surfaces with an improved cracknet. Journal of Computing in Civil Engineering , 32(5), 2018.
- 6[6] Eirikur Agustsson and Radu Timofte. Ntire 2017 challenge on single image super-resolution: Dataset and study. In CVPRW , 2017.
- 7[7] Kazutoshi Akita, Muhammad Haris, and Norimichi Ukita. Region-dependent scale proposals for super-resolution in object detection. In IPAS , 2020.
- 8[8] Kazutoshi Akita, Masayoshi Hayama, Haruya Kyutoku, and Norimichi Ukita. AVM image quality enhancement by synthetic image learning for supervised deblurring. In MVA , 2021.
