Data Selection for training Semantic Segmentation CNNs with   cross-dataset weak supervision

Panagiotis Meletis; Rob Romijnders; Gijs Dubbelman

arXiv:1907.07023·cs.CV·July 17, 2019

Data Selection for training Semantic Segmentation CNNs with cross-dataset weak supervision

Panagiotis Meletis, Rob Romijnders, Gijs Dubbelman

PDF

TL;DR

This paper introduces two data selection methods for training semantic segmentation CNNs with weak supervision, significantly reducing the amount of data needed while maintaining performance, especially in automated driving datasets.

Contribution

It presents novel data selection techniques based on image similarity and object diversity, improving training efficiency for weakly supervised semantic segmentation.

Findings

01

Performance gains by reducing weakly labeled data up to 100 times for Open Images.

02

Effective data selection improves segmentation accuracy in automated driving datasets.

03

Insights into data distribution characterization through GMM modeling.

Abstract

Training convolutional networks for semantic segmentation with strong (per-pixel) and weak (per-bounding-box) supervision requires a large amount of weakly labeled data. We propose two methods for selecting the most relevant data with weak supervision. The first method is designed for finding visually similar images without the need of labels and is based on modeling image representations with a Gaussian Mixture Model (GMM). As a byproduct of GMM modeling, we present useful insights on characterizing the data generating distribution. The second method aims at finding images with high object diversity and requires only the bounding box labels. Both methods are developed in the context of automated driving and experimentation is conducted on Cityscapes and Open Images datasets. We demonstrate performance gains by reducing the amount of employed weakly labeled images up to 100 times for…

Tables5

Table 1. TABLE I : Performance ( mIoU ) on Cityscapes Dense classes that receive extra supervision from Open Images.

	# of selected images images
Method of selection	1k (0.1%)	10k (1%)	20k (2%)	100k (10%)
random	67.05	67.68	68.51	67.88
heuristics	65.97	67.45	68.01	68.88
GMM	68.67	68.52	68.88	69.00
heuristics + GMM	68.03	68.92	69.15	69.23

Table 2. TABLE II : Detailed per class IoU for the GMM selection method using weakly labeled data from Open Images. Only the 8 out of 19 Cityscapes classes are shown that receive extra supervision.

# of images	car	truck	bus	train	motorcycle	bicycle	person	rider
1k	92.2	68.2	76.9	71.2	50.7	67.5	71.7	51.0
10k	92.2	69.7	79.3	65.6	48.2	67.7	71.6	51.2
20k	92.5	73.1	79.9	60.9	53.3	67.7	71.8	52.0
100k	92.4	69.6	78.8	67.8	51.1	68.0	71.9	52.6

Table 3. TABLE III : Performance ( mIoU ) on Cityscapes Dense classes that receive extra supervision from Cityscapes Coarse.

	# of selected images images
Method of selection	1k (5%)	5k (25%)	10k (50%)	20k (100%)
random	66.68	66.82	69.38	69.87
heuristics	66.61	67.58	68.77
heuristic + GMM	68.37	68.29	67.41

Table 4. TABLE IV : Ablation on the number of components of GMM for the mIoU performance using the Open Images as the weak dataset.

K components	5	20	50
BIC ( $\cdot 10^{6}$ ) $↓$	3.2	50.7	124.7
mIoU $↑$	68.67	65.86	64.18

Table 5. TABLE V : Common images selected by two techniques.

# of selected images	1k	10k	20k	50k	100k	200k
# of common images	3	117	385	1492	4303	10704
percentage	0.3%	1.17%	1.93%	2.98%	4.3%	5.35%

Equations8

Φ_{i} = {f_{h, w, :} (x_{i}), \forall h \in H, w \in W}

Φ_{i} = {f_{h, w, :} (x_{i}), \forall h \in H, w \in W}

t o 0.0 pt \leavevmode \set@color \hss \leavevmode \set@color Φ_{i} = \frac{1}{∣ Φ _{i} ∣} e \in Φ_{i} \sum e

t o 0.0 pt \leavevmode \set@color \hss \leavevmode \set@color Φ_{i} = \frac{1}{∣ Φ _{i} ∣} e \in Φ_{i} \sum e

lo g L (Ψ) = i = 1 \sum N lo g j = 1 \sum K π_{j} N (t o 0.0 pt \leavevmode \set@color \hss \leavevmode \set@color Φ_{i}; μ_{j}, σ_{j}^{2})

lo g L (Ψ) = i = 1 \sum N lo g j = 1 \sum K π_{j} N (t o 0.0 pt \leavevmode \set@color \hss \leavevmode \set@color Φ_{i}; μ_{j}, σ_{j}^{2})

s i m_{c i t y s} (x_{i}) = max lo g p (Φ_{i}; Ψ_{c i t y s})

s i m_{c i t y s} (x_{i}) = max lo g p (Φ_{i}; Ψ_{c i t y s})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Data Selection for training Semantic Segmentation

CNNs with cross-dataset weak supervision

Panagiotis Meletis, Rob Romijnders, and Gijs Dubbelman Panagiotis Meletis ([email protected]), Gijs Dubbelman ([email protected]), and Rob Romijnders ([email protected]) are with the Department of Electrical Engineering, Eindhoven University of Technology, Eindhoven, The Netherlands. This research has received funding from ECSEL JU in collaboration with the European Union’s H2020 Framework Programme and National Authorities, under grant agreement no. 783190.

Abstract

Training convolutional networks for semantic segmentation with strong (per-pixel) and weak (per-bounding-box) supervision requires a large amount of weakly labeled data. We propose two methods for selecting the most relevant data with weak supervision. The first method is designed for finding visually similar images without the need of labels and is based on modeling image representations with a Gaussian Mixture Model (GMM). As a byproduct of GMM modeling, we present useful insights on characterizing the data generating distribution. The second method aims at finding images with high object diversity and requires only the bounding box labels. Both methods are developed in the context of automated driving and experimentation is conducted on Cityscapes and Open Images datasets. We demonstrate performance gains by reducing the amount of employed weakly labeled images up to 100 times for Open Images and up to 20 times for Cityscapes.

I Introduction and Related Work

Recently, multiple dataset training of convolutional networks is gaining attention [1, 2, 3], since it offers improved performance and better generalization capabilities compared to single dataset training. Multiple dataset training is especially advantageous for training semantic segmentation networks, which requires large amounts of training examples [4]. However, factors as different dataset sizes, repetitive examples (low informative value), and high annotation costs, hamper the effectiveness of multiple dataset training. These factors especially influence methods that employ weaker forms of supervision [5, 6, 7, 2].

The current trend to deal with the aforementioned challenges is model selection, i.e. design, train and tune a convolutional network for robustness and performance. A less studied research branch, data selection [8, 9, 10] appears more appealing. In this work, we propose two data selection methods, which indicate how data should be chosen for maximizing visual similarity and object diversity among used datasets, and are well suited for multiple dataset training. The first method, employs a Gaussian Mixture Model (GMM) in order to model image representations of a dataset, and the second one uses predefined scoring heuristics to rank images.

Our data selection methods can be employed in cases where: 1) fewer data need to be used, by selecting the most informative images, 2) fewer data need to be annotated, by selecting most similar images between labeled and unlabeled datasets, and 3) a balanced amount of examples between datasets is preferred (e.g. for multiple dataset training).

In this work we focus in the problem of training a semantic segmentation model using strong (per-pixel) and weak (per-bounding-box) supervision from different datasets and we show the benefits of the proposed data selection schemes both in performance and in decreasing the number of needed examples. Specifically, data selection allows the a model to reach same levels of performance using 10 to 100 times less weakly annotated data. Furthermore, we present results towards modeling the visual domain of a dataset and quantifying visual similarity and object diversity.

To summarize, in this work we:

•

propose a selection method, based on modeling image representations with a GMM, for finding visually similar images to a given dataset,

•

propose a selection method, based on class scoring heuristics, for finding rich labeled images,

•

apply both methods independently and jointly in weak supervision selection for semantic segmentation to reduce the amount of required training examples while increasing performance, and

•

present results towards characterizing the image domain of a dataset through GMM modeling.

Our trained convolutional networks, the GMM models, and the two selection methods algorithms will be made available to the research community [11].

II Problem Definition

Semantic Segmentation is a pixel-level task and as such, it requires a large amount of per-pixel labeled images that are hard to obtain. This is specifically costly in the automated driving field, where small sized per-pixel labeled datasets are available. Although much larger datasets exist for complementary tasks to semantic segmentation, like object detection or classification, they are not specialized on street scenes, but contain generic complex scenes. Thus, apart from the different type of annotations, we have to deal also with the domain gap between the chosen datasets, since it is also prefered the trained model to have good generalization properties. The methods that have been proposed in the literature attempt to take advantage of these datasets with a weakly or semi-supervised learning approach.

In this paper, we assume that we have

a labeled dataset $C=\{(x,y)_{i},~{}i=1,...,N\}$ with $N$ pairs of images $x$ and per-pixel labels $y$ for semantic segmentation, 2. 2.

a dataset $O=\{(x[,z])_{i},~{}i=1,...,M\}$ with $M$ images $x$ and optionally $M$ weak labels $z$ from the perspective of semantic segmentation (e.g. bounding boxes or image-level labels), where $M\gg N$ , and 3. 3.

a convolutional network model that can be trained on strong and weakly labeled datasets,

and we seek a methodology for selecting images from $O$ that are visually similar to $C$ and informative enough for $y$ .

The data selection problem requires selecting images from $O$ that are as visually similar to images in $C$ as possible. This is desired, since the domain gap, which can be seen as the dissimilarity in image content and appearance, may hinder the training procedure in the multiple dataset setting. In the same time, it is also preferred that the images have high informative value for the classes that we want to train for, or in other words to have high object diversity.

In this work we experiment with Cityscapes Dense subset [12] as $C$ , and Cityscapes Coarse subset and Open Images bounding boxes subset [13] as $O$ . We employ the hierarchical convolutional networks of [14] that can be trained on multiple datasets with strong and weak labels, which require the labels $z$ of dataset $O$ .

III Method

In this Section we describe our two proposed data selection methods, how they can be combined, and their connection with the notions of visual similarity and object diversity. In this work, we aim at discovering visual similarity using only the images and not the associated labels, and object diversity using only the labels and not the images. We experiment on three datasets, namely Open Images, Cityscapes Dense, and Cityscapes Coarse (see Section IV-A for more details).

The goal of our methods is to select images from the weakly labeled datasets (Open Images, Cityscapes Coarse), that are visually similar and have high object diversity compared to the strongly labeled dataset (Cityscapes Dense).

III-A Gaussian Mixture Model: visual similarity

Inspired by [8, 15] our method consists of three distinct phases that are described in the following Sections. First, we use a pre-trained convolutional network to extract a low dimensional representation for each Cityscapes Dense image, then we fit a GMM to those representations, and finally we use that model to rank the images of the weakly labeled datasets. We hypothesize that images that are visually similar to Cityscapes Dense, i.e. depict street scenes, will have high probability density under the GMM and images from generic scenes, i.e. the majority of Open Images images not containing street scenes, will have low probability density.

Extracting image representations

We aim to capture the distribution of Cityscapes Dense image domain. Unfortunately, it is hard to fit probabilistic models to images in general [16], [17], and current state of the art in generative modelling of images does not assign calibrated density [18]. As such, we extract representations from a fully convolutional network trained for semantic segmentation on Cityscapes Dense. The first layers of the trained neural network will serve for the extraction of representations, as we know that initial layers of a neural network maintain information about the input images [19].

Neural networks are known to learn internal representations irrespective of the task they are trained on [20], [21]. Thus, training could involve any computer vision task, such as classification, segmentation, or detection, but we leave this for investigation in future research. In this work, we use a convolutional network to extract the image representations and we choose to train it for semantic segmentation, as this is the task where the selected images will be eventually used.

The backbone consists of a ResNet-50 [22], which we modify for semantic segmentation as in [2]. We choose to extract features from the penultimate convolutional layer, which has shape $\left(H,W,C\right)$ , and we call this subnetwork $f$ . If $x_{i}$ is the input image, the convolutional representation can be denoted as the set

[TABLE]

where $h,w$ index all the receptive fields of the penultimate layer corresponding to different regions on the input image $x_{i}$ . In other words we slice the output of $f$ , to the set $\Phi$ containing $H\cdot W$ elements with $C$ features (depth) each.

Modeling image representations

We want to fit a probabilistic model to the low dimensional representations $\Phi_{i}$ for all images $x_{i}$ of dataset $C$ (see Section II). Such model would assign large probability density to the representations of the modeled domain (Cityscapes Dense), and low density to images outside this domain.

Here we choose a Gaussian Mixture Model (GMM) [23] for its simplicity and explicitness. Since assigning probability densities to entire Cityscapes images would be too costly task, we assume that for every image the set $\Phi_{i}$ contains independent and identically distributed representations and we average its elements:

[TABLE]

The next step is to model with a GMM the average representations $\hbox to0.0pt{\hskip 1.1111pt\leavevmode\hbox{\set@color$ \overline{\hbox{}} $}\hss}{\leavevmode\hbox{\set@color$ \Phi $}}_{i}$ for all images. A GMM is a mixture of K Gaussian distributions, with variable mixture coefficients $\pi_{j}$ , means $\mu_{j}$ , and variances $\sigma_{j}^{2}$ for the Gaussian distributions. We group those parameters into $\Psi=\left\{\pi_{1},...,\pi_{K},\mu_{1},...,\mu_{K},\sigma_{1},...,\sigma_{K}\right\}$ . The log likelihood function for $\Psi$ , given the independent average representations $\hbox to0.0pt{\hskip 1.1111pt\leavevmode\hbox{\set@color$ \overline{\hbox{}} $}\hss}{\leavevmode\hbox{\set@color$ \Phi $}}_{i}$ for all images $N$ of Cityscapes Dense, can be expressed as

[TABLE]

The Maximum Likelihood estimate $\Psi_{citys}$ is found using Eq. 3 and the Expectation Maximization algorithm [23].

Image to dataset visual similarity

We define a measure of similarity to the domain that is modeled by the GMM, i.e. Cityscapes Dense, so we can rank images from others datasets, as the maximum log probability under the model for all receptive fields of an image $x_{i}$ :

[TABLE]

and according to our hypothesis the larger $sim$ is, the image is visually more similar to the modeled Cityscapes Dense image domain. We rank weakly labeled images using $sim_{citys}$ in descending order and we select various top portions for the experiments of Sections V-B, V-C.

III-B Class score heuristics: object diversity

A training image has high object diversity when it contains a large variety and number of objects of interest. In the context of automated driving we define three categories of objects, namely traffic objects (traffic signs and traffic lights), vehicles (car, truck, bus, motorcycle, bicycle, train), and humans (pedestrian, rider) and we assign to each category a score. These scores are defined by empirical tests and manual inspection of the images, and they depend on each dataset. The general intuition behind scoring is that traffic objects are most probable to appear in street scenes only, while vehicles and humans can appear in a variety of other scenes.

For Open Images, all above categories are labeled in a instance-wise manner and we assign 100, 10 and 1 points to them respectively. For Cityscapes traffic objects are not labeled instance-wise, so we assign weights for the last two categories, as 10 and 1 respectively. For each image the total score from all labeled objects is accumulated. The images are ranked according to their score and different top portions are selected for the experiments of Sections V-B, V-C.

III-C Combine the two selection methods

In the previous two Sections we described the two selection schemes and how they result in two rankings of the images of a dataset. In general the two rankings can have a different ordering, thus aggregating them into one collection is not a trivial task. Since, we equally prefer visual similarity and object diversity we opt for interleaving the rankings by interchangeably choosing images from the initial rankings to the final selection. In the process, if an image is already inserted in the final selection it is not inserted twice.

IV Implementation details

In this Section, we describe the chosen convolutional model for training simultaneously on datasets with strong and weak supervision, we present the used datasets, and we provide the hyperparameters employed in training to enable reproducibility of our experiments.

IV-A Datasets

Cityscapes Dense: Cityscapes dataset [12] contains street scene images from German cities taken from a 2 Mpixel camera mounted on a car. We used the training subset with 2975 densely (per-pixel) labeled images and the bigger subset of 20000 coarsely (per-pixel) labeled images.

Open Images v4: This dataset [13] contains 9 million images from everyday, complex scenes collected from the internet and has multiple resolutions, shooting angles, and several objects that are not relevant for automated driving. The official subset labeled with 14.6 million bounding boxes contains 1.74 million images.

Cityscapes Coarse bboxes: This a dataset with bounding boxes that was created for the purpose of this paper from the coarse, per-pixel, instance labels of Cityscapes Coarse subset. Specifically, for each labeled object in an image we define a bounding box using the minimum and maximum coordinates of per-pixel labels in each axis.

IV-B Convolutional model for training on strong and weak supervision

We use our published hierarchical convolutional network for training simultaneously on weak and strong supervision for semantic segmentation [2, 14]. The network consists of a conventional ResNet-50 feature extractor, that is modified to have semantic segmentation output with dilated convolutions and an upsampling module. Moreover, instead of one per-pixel classifier it consists of a hierarchy of classifiers arranged in a tree structure according to the classes hierarchy. Each of the classifiers is fed with the same convolutional features of the feature extractor. During inference, the results from all classifiers are aggregated in a per-pixel manner to output the final decisions.

IV-C Training details

For fair comparisons we train all networks in Section V with the same hyperparameters and for the same number of epochs. For the image representation extraction of Section III-A we use input image dimensions of 1024 by 2048, which are reduce to a grid of 256 by 512 receptive fields, each observing an area of approximately 200 by 200 pixels on the full image. The representation’s depth is 256.

In training the GMM, we sample from the $256\cot 512\cdot 2975\approx 390\cdot 10^{6}$ 256-dimensional representations only 24k from all the images in the Cityscapes Dense training set. For the GMM model, we fit the parameters of the mixtures using Expectation Maximization. We continue updates until the likelihood does not change from one E step to another by more than 0.001 nat. We use the open source implementation of the Scikit learn library [24].

V Experiments

First, we present the overall results in Section V-A for our two selection methods applied on two diverse datasets and we analyze them in Sections V-C, V-B. Then in Section V-D, we perform ablation experiments for the parameters of the models and we present an analysis and intuitions behind our methods. All IoU results refer to training a hierarchical segmentation network, as described in Section IV, on Cityscapes Dense (per-pixel labels) and on the weakly labeled (per-bounding-box) dataset (Cityscapes Coarse or Open Images). We evaluate the proposed data selection methods on the per-pixel labeled Cityscapes validation set unless otherwise noted. We use the Intersection over Union metric [4]. The IoU results are averaged over the last 5 (Cityscapes) epochs, when the model converges, since the variance is high. The mIoU results are the mean IoU over the classes that receive extra supervision from the weakly labeled dataset.

V-A Overall results

In Figure 3 the mIoU performance on Cityscapes validation set for various combinations of our selection methods and datasets is shown. We experiment on two different datasets. Cityscapes Coarse is a subset of Cityscapes and as such contains images of street scenes all shot from a specific point of view. Open Images is a generic scene dataset, collected from various image sources and point of views, and street scenes are rare. We demonstrate that data selection from both datasets is beneficial, by improving performance, while reducing the amount of required data.

From Figure 3 we observe that class scoring heuristics, is advantageous for both datasets. As anticipated, the mIoU for the same amount of selected data, in the case of Cityscapes Coarse is always higher than Open Images, since the former dataset has visually similar images to Cityscapes validation set than the latter. Interestingly, when the selection quantity is limited to 1000 images, scoring heuristics perform even lower than the baseline. This is unintuitive since we add data over the baseline, but it can be explained because we use exactly the same hyperparameters (number of epochs, etc.) for all training rounds. Finally, GMM selection, has very high performance using as little as 1000 selected images, but the increase is marginal, when more images are selected.

Furthermore, it is worth noting that in the case of selection from Open Images, GMM selection attains almost the same performance as scoring heuristics, but requires 100 times less images. Finally, combining the two selection methods yields better results than using the methods separately for every amount of selected images, except for the 1k case.

V-B Training with Cityscapes Dense and Open Images

Open Images is a completely different dataset than Cityscapes. It contains images from a variety of generic natural scenes, and the street scene images are very limited [13], thus the domain gap between them is large, as can be seen also from Figure 5. Open Images is labeled with 600 classes, the majority of which are not relevant for automated driving. We experiment with both of our selection methods, since both diversity for street scene classes and visual similarity with Cityscapes Dense is needed.

Table I shows the detailed mIoU performance on Cityscapes Dense, when the hierarchical model is trained on per-pixel labels of Cityscapes Dense and on various amounts of selected images from per-bounding-box labels of Open Images. In the first row, the mIoU for random selection is shown, which represents a strong baseline. In the second and third row, the proposed techniques of Section III are studied. We observe that selection through GMM has higher gain in small amount of weakly labeled images, while selection with scoring heuristics is better when using more that 20000 weakly labeled images.

Moreover, we investigate the option of combining both selection methods, so we have high visual similarity and object diversity. The final collection of images is obtained by selecting the same amount from each of the two rankings so each method contributes half of the selected images, after removing duplicate images. Interestingly, the two selection methods have dissimilar rankings, as can be seen in the analysis V.

Finally, we observe that for different number of available images with weak labels, a different selection method is more suitable. Knowing that Cityscapes Dense has 2975 training images, we observe that, if only 1000 weakly labeled images are available, then selecting similarity (GMM) over diversity (heuristics) works better, and the model does not overfit. If weakly labeled images are 100 times more, then selecting object diversity gives better results.

In Table II we present the detailed per class IoU for the GMM selection method. Three classes (car, bicycle, and person) have little gain in performance from the increase in the number of selected images. Four classes (truck, bus, motorcycle, and rider) have significant gain in performance. A potential reason for both groups can be the different point of view that these classes are depicted in the images of Cityscapes and Open Images.

The most interesting case is the train class. We observe that as we include more images until 20k images, the IoU drops dramatically and rises back to a satisfactory level only when using 100k images. This clearly signifies that although the images including trains may appear visually similar in whole, the trains between Cityscapes and Open Images have completely different appearance. This remarks the need of investigating visual similarity per class instead of per image, but we leave this for future research.

V-C Training with Cityscapes Dense and Cityscapes Coarse

Cityscapes Coarse is a subset of Cityscapes, and thus is visually similar to Cityscapes Dense, where performance is evaluated on. Thus, through this experiment we can examine the selection method aiming for object diversity in isolation, however for completeness we present also results from combining both selection methods. Table III illustrates that our selection methods are useful when using few images from the weakly labeled dataset, but there is no applicability for using more than 10k images. The significant performance drop for the third column with 10k images, is expected and is due to our chosen scheme of scoring heuristics. Specifically, by examining the per class IoU results, we discover that the performance drop is proportional to the scores we assigned during the class scoring heuristics. Moreover, as can be seen from the last row of Table III GMM selection does not add much since visual similarity is already attained by using same dataset.

V-D Analysis and ablation experiments

In this Section we present results towards characterizing the image domains of the datasets we used and perform ablation experiments on the number of components $K$ of the GMM model.

Dataset characterization and visual similarity

Figure 4 shows the tSNE embeddings [25] of the 256-dimensional image representations $\bar{\Phi}$ for a sample of images from the two datasets. For both datasets, we sampled 3000 representations randomly using our model on the respective training sets. We observe that the distribution of representations have minimal overlap. This separation explains why the GMM can fit the representations from Cityscapes so well and single out representations from Open Images that are dissimilar.

Figure 5 illustrate statistics of the visual similarity measure defined in Section III, i.e. max log probability of the GMM, for all images of the three used datasets. From Figure 5 it can be seen that the histogram of the max log probability for Cityscapes Dense and Coarse image subsets are very similar and confirms their common origin as subsets of Cityscapes. On the contrary, the histogram of Open Images is more spread and has a very small overlap with Cityscapes Dense. The spread of Open Images histogram shows that the scene variety is high, and only a small subset is visually similar to Cityscapes. The difference confirms our hypothesis that the images from Open Images follow a different distribution.

Number of GMM components

In this ablation experiment we investigate the optimal number $K$ of GMM components for modeling the Open Images domain, having as a indirect metric the performance on Cityscapes validation. As can be seen from Table IV $K=5$ gives the higher mIoU. The selection of this parameter follows the intuition that representations have a simple and compact structure, as indicated by the tSNE plot of Figure 4, and is guided by the Bayesian Information Criterion (BIC).

Common images in rankings

When both rankings from different selections methods are used (experiment in Table I), a conflict of ranking position arises, on which we opted for each ranking to contribute half images to the final selection. Here we compute the agreements of the two ranking approaches. From Table V it can be seen that the two selection methods have different preferences and also that visual similarity does not induce object diversity and vice verse. Specifically, we were surprised to find out that in 1000 selected images only 0.3% were selected in the top 500 from both methods.

VI Discussion and future work

Although we apply our data selection methodology in selecting weak supervision data for multiple dataset semantic segmentation training, it can be used in a variety of problems, such as choosing image for selective per-pixel annotation, balancing training data between datasets, and selection from multiple datasets for semantic segmentation which we will explore in future work.

Another point of discussion concerns using the GMM to capture the representation manifold. We have no grounded reasons to assume that the representations follow local Gaussian distributions. We have attempted to use dimensionality reduction techniques, like PCA and tSNE, but they give no guarantees. To better capture the representation manifold of a particular domain, we might consider more flexible distribution approximators, like normalizing flows[26], or kernel density estimators[27]. Despite these considerations, our approach using GMM’s have shown improvements, so we expect even better improvements using more flexible density estimators.

A final point for consideration is to select images using local, per-object visual similarity instead of image-level similarity. It is clear from experiments of Sections V-A, V-B that although the selected images are similar and depict street scenes in general, the appearance of objects in them can be completely different that what we wish for.

VII Conclusion

We presented two data selection methods targeting visual similarity and object diversity for the problem of semantic segmentation. We tested both methods in the case of training a convolutional network with strong and weak supervision. The selection methods proved particularly useful for selecting images from the weakly labeled datasets, and dramatically decreased the number of required training images, 20 times for Cityscapes and 100 times for Open Images. Moreover, we took steps for characterizing the visual domain of a dataset by modeling the representations of its images with a Gaussian Mixture Model.

Bibliography27

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] D. Fourure, R. Emonet, E. Fromont, D. Muselet, A. Trémeau, and C. Wolf, “Semantic segmentation via multi-task, multi-domain learning,” in Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR) . Springer, 2016, pp. 333–343.
2[2] P. Meletis and G. Dubbelman, “Training of convolutional networks on multiple heterogeneous datasets for street scene semantic segmentation,” in 2018 IEEE Intelligent Vehicles Symposium (IV) . IEEE, 2018, pp. 1045–1050.
3[3] A. Geiger and et. al., “Robust vision challenge,” http://robustvision.net/index.php, 2018, [Online; accessed 12-April-2019].
4[4] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2015, pp. 3431–3440.
5[5] L. Ye, Z. Liu, and Y. Wang, “Learning semantic segmentation with diverse supervision,” in 2018 IEEE Winter Conference on Applications of Computer Vision (WACV) . IEEE, 2018, pp. 1461–1469.
6[6] J. Xu, A. G. Schwing, and R. Urtasun, “Learning to segment under various forms of weak supervision,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2015, pp. 3781–3790.
7[7] M. P. Kumar, H. Turki, D. Preston, and D. Koller, “Learning specific-class segmentation from diverse data,” in 2011 International Conference on Computer Vision . IEEE, 2011, pp. 1800–1807.
8[8] V. Birodkar, H. Mobahi, and S. Bengio, “Semantic redundancies in image-classification datasets: The 10% you don’t need,” ar Xiv preprint ar Xiv:1901.11409 , 2019.