Deep Learning-Based Semantic Segmentation of Microscale Objects

Ekta U. Samani; Wei Guo; and Ashis G. Banerjee

arXiv:1907.03576·eess.IV·July 9, 2019

Deep Learning-Based Semantic Segmentation of Microscale Objects

Ekta U. Samani, Wei Guo, and Ashis G. Banerjee

PDF

Open Access 1 Repo

TL;DR

This paper introduces a deep learning model for semantic segmentation of microscale objects in crowded environments, achieving high accuracy and improving automated manipulation techniques like optical tweezers.

Contribution

The paper presents a novel deep learning approach that significantly enhances segmentation accuracy in complex microscale environments compared to traditional methods.

Findings

01

Achieved a mean Intersection Over Union score of 0.91.

02

Successfully segmented crowded microscale environments.

03

Improved accuracy over traditional computer vision algorithms.

Abstract

Accurate estimation of the positions and shapes of microscale objects is crucial for automated imaging-guided manipulation using a non-contact technique such as optical tweezers. Perception methods that use traditional computer vision algorithms tend to fail when the manipulation environments are crowded. In this paper, we present a deep learning model for semantic segmentation of the images representing such environments. Our model successfully performs segmentation with a high mean Intersection Over Union score of 0.91.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ektas0330/cell-segmentation
tf

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMedical Image Segmentation Techniques · Advanced Neural Network Applications · Cell Image Analysis Techniques

Full text

Deep Learning-Based Semantic Segmentation of Microscale Objects

Ekta U. Samani1, Wei Guo2, and Ashis G. Banerjee3 1E. U. Samani is with the Department of Mechanical Engineering, University of Washington, Seattle, WA 98195, USA, [email protected]2W. Guo is with the Department of Industrial & Systems Engineering, University of Washington, Seattle, WA 98195, USA, [email protected]3A. G. Banerjee is with the Department of Industrial & Systems Engineering and the Department of Mechanical Engineering, University of Washington, Seattle, WA 98195, USA, [email protected]

Abstract

Accurate estimation of the positions and shapes of microscale objects is crucial for automated imaging-guided manipulation using a non-contact technique such as optical tweezers. Perception methods that use traditional computer vision algorithms tend to fail when the manipulation environments are crowded. In this paper, we present a deep learning model for semantic segmentation of the images representing such environments. Our model successfully performs segmentation with a high mean Intersection Over Union score of 0.91.

I INTRODUCTION

Optical tweezers are widely used for non-contact manipulation of objects at the micro-scale by grasping them using tightly focused laser beams. Accurate, real-time estimation of the states (locations, sizes, etc.) of all the environment objects is necessary to automate the manipulation process. These states are typically estimated from low contrast, bright field images obtained using a charge-coupled device camera. A perception method that combines contrast enhancement, edge detection, and convolutional neural networks is proposed in [1]. The method performs reasonably well in environments where the number of objects is limited, and successfully estimates the individual positions of the clustered objects. However, it encounters challenges when a large number of objects are present in close proximity to each other. This paper provides a step toward addressing this challenge by presenting a deep learning-based semantic segmentation method that is capable of estimating not just the positions but also the shapes of all the objects in crowded environments. Such a capability would be useful in developing a complete situational awareness of the manipulation environments, thereby, paving the way for robust motion planning and control methods.

II METHODOLOGY

We use the images from [1] that contain multiple silica microspheres (beads) of $5\mu m$ diameter and human endothelial cells dispersed in Matrigel and Thrombin. Eighty such images of resolution $640\times 480$ are used for our analysis. We define three different classes in these images, namely, the background, cells, and beads. The images are labeled using LabelMe[2], a polygonal annotation tool for images. Each pixel of an image is labeled such that it belongs to one of the three defined classes. Seventy-two images are used for training and validation purposes, and eight are set aside for testing. The training and validation images are divided into 75,000 images of size $256\times 256$ . These sub-images are generated by random sampling from the larger image and a randomly chosen rotation from the dihedral group. Pre-processing steps of image normalization, histogram equalization and gamma correction are performed on all the sub-images. The deep learning model is then trained on 60,000 sub-images and the remaining 15,000 are used for validation.

We use an encoder-decoder architecture proposed in the TGS Salt Identification challenge hosted by Kaggle[3]. Our model consists of Xception [4], pre-trained on the Imagenet database as the encoder and a ResNet-based decoder111The last decoder block does not have a concatenate layer, unlike other decoder blocks., as shown in Fig. 1. Intersection Over Union (IOU) is the most commonly used metric to quantify the overlap between the ground truth labels and the predicted mask. Therefore, we use the IOU score as our performance measure, which is calculated separately for each of the three classes and averaged to give a final IOU score. Lovász-softmax loss [5] is shown to be more suitable than the more generally used cross-entropy loss for optimizing the IOU metric. Hence, we choose the multi-class Lovász-softmax loss as our loss function. Since normalized gradient methods with constant step sizes and occasional decay perform better in deep convolutional neural networks than optimizers with adaptive step sizes such as Adam[6], we use normalized stochastic gradient descent with cosine annealing and momentum as the optimizer. We choose an initial learning rate of 0.001 and leaky rectified linear unit (ReLU) as the activation function. We do not apply any activation function to the output of our last convolutional layer. The implementation of Lovász-softmax loss applies a softmax activation internally to the output logits from the last layer before loss calculation.

For testing, the original image is padded with a reflection-based border padding to obtain a $768\times 512$ sized image. The padded image is divided into six non-overlapping sub-images of size $256\times 256$ that span the full image. Predictions are obtained for the six sub-images and are combined according to their respective positions to get a prediction mask for the padded image. Predictions corresponding to the border padding are discarded to obtain a final prediction mask for the original test image. We use the Opencv library to detect the contours for the cells and beads by tracing the boundaries of the segmented regions from these final prediction masks.

We also compare the performance of our model with a fully residual convolutional neural network proposed in [7]. The network consists of a contracting path that encodes the input to high-level features and an expanding path that decodes the features to the output mask. The contracting path consists of repeated stacks of a $3\times 3$ convolution layer, a residual block and a $2\times 2$ down-sampling layer. Before down-sampling, the feature map channels are doubled using a $1\times 1$ convolution layer. The expanding path is similar to the contracting path, but it has up-sampling layers instead of the down-sampling layers. The down-sampling layers use mean-pooling while the up-sampling layers use bilinear interpolation. The higher resolution feature maps from contracting path are concatenated with the corresponding up-sampled feature maps in the expanding path. We use the ELU activation function and residual blocks consisting of ELU-Convolution-Dropout-ELU-Convolution-Scaling, as described in [7]. We use the Adadelta optimizer with a learning rate of 0.0001. The network is originally designed for performing structured regression. Therefore, it uses a generalized version of the weighted square error loss function. To perform semantic segmentation using this network, we replace this loss function with the Lovász-softmax loss function, and we use the IOU score as the performance measure. We also change the last convolutional layer of the network to obtain three feature map channels, one for each class, in the output.

III IMPLEMENTATION AND RESULTS

All the training and testing are done on a workstation running Windows 10 operating system, equipped with a 3.7GHz 8 Core Intel Xeon W-2145 CPU, GPU ZOTAC GeForce GTX 1080 Ti, and 64 GB RAM. Our model converges with a validation IOU score of 0.97 after 16 epochs with a batch size of 8 in 48.41 hours. We obtain an IOU score of $0.91\pm 0.02$ for the eight test images. Fig. 2 shows a typical test image with multiple cells and beads. Fig. 3 shows the corresponding predicted mask where violet color corresponds to the background, red color corresponds to the cells, and green color corresponds to the beads. Fig. 3 shows the detected contours overlaid on the original test image. Cell boundaries are red in color and bead boundaries are green in color. The model accurately segments all the cells and beads from the background. It also successfully differentiates the beads that are stuck to the cell. It is observed that most of the mislabeled pixels belong to barely visible objects on the border.

The fully residual convolutional network achieves a validation IOU score of 0.73 after 19 epochs with a batch size of 8 in 57.48 hours. We obtain an IOU score of $0.39\pm 0.15$ for the eight test images. We observe high bias in the performance on the training set indicating that training for a longer duration may improve the predictions. Moreover, it points towards the need for a deeper network or a different architecture. We also observe high variance in the performance on the validation set despite extensive use of dropout in the network. Use of a pre-trained encoder, as in our proposed model, can help alleviate this problem to some extent. Fig. 4 shows the predicted mask and Fig. 4 shows the detected contours corresponding to the test image in Fig. 2. We observe from the segmentation mask in Fig. 4 that the model is somewhat able to identify the shape of the objects, but it is unable to differentiate between the cells and the beads. Therefore, all the detected contours in Fig. 4 are incorrectly identified as bead boundaries.

Fig. 5 shows another test image with multiple cells close together in a polymerizing Matrigel medium. Fig. 6 shows the contours detected from the segmentation output of the proposed model. We observe that the model accurately detects all the cell boundaries in the continuously changing medium. Fig. 7 shows the performance of the fully residual convolutional neural network on the same test image. The network fails to recognize whether the segmented regions are cells or beads. This results in incorrect identification of the detected boundaries.

IV CONCLUSIONS

We present a deep learning model for accurate pixel-based semantic segmentation of micro-scale objects (cells and beads) in crowded environments. We also compare the performance of our model with a fully residual convolutional network. Our model accurately segments all the objects present in different environments. It also successfully distinguishes objects belonging to different classes that are clustered together. In the future, we plan to employ another deep learning model to segment the individual instances of the bead and cell classes.

Bibliography7

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] K. Rajasekaran, E. Samani, M. Bollavaram, J. Stewart, and A. Banerjee, “An accurate perception method for low contrast bright field microscopy in heterogeneous microenvironments,” Appl. Sci. , vol. 7, no. 12, p. 1327, 2017.
2[2] Wkentaro, “Github -wkentaro/labelme,” Apr 2019. [Online]. Available: https://github.com/wkentaro/labelme
3[3] “TGS salt identification challenge.” [Online]. Available: https://www.kaggle.com/c/tgs-salt-identification-challenge
4[4] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in IEEE Conf. Comput. Vis. Pattern Recognit. , 2017, pp. 1251–1258.
5[5] M. Berman, A. Rannen Triki, and M. B. Blaschko, “The Lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks,” in IEEE Conf. Comput. Vis. Pattern Recognit. , 2018, pp. 4413–4421.
6[6] A. W. Yu, L. Huang, Q. Lin, R. Salakhutdinov, and J. Carbonell, “Block-normalized gradient method: An empirical study for training deep neural network,” ar Xiv preprint ar Xiv:1707.04822 , 2017.
7[7] Y. Xie, F. Xing, X. Shi, X. Kong, H. Su, and L. Yang, “Efficient and robust cell detection: A structured regression approach,” Med. Image Anal. , vol. 44, pp. 245–254, 2018.