Unsupervised Feature Learning in Remote Sensing

Aaron Reite; Scott Kangas; Zackery Steck; Steven Goley; Jonathan Von; Stroh; and Steven Forsyth

arXiv:1908.02877·cs.CV·September 24, 2019

Unsupervised Feature Learning in Remote Sensing

Aaron Reite, Scott Kangas, Zackery Steck, Steven Goley, Jonathan Von, Stroh, and Steven Forsyth

PDF

TL;DR

This paper demonstrates the application of a state-of-the-art unsupervised learning algorithm to remote sensing data, enabling effective feature extraction for various tasks without requiring labeled data.

Contribution

It introduces an unsupervised feature learning approach tailored for remote sensing data, capable of handling noisy, imbalanced datasets and multiple tasks.

Findings

01

Effective visual similarity search across classes

02

Successful outlier detection in imbalanced data

03

Automatic learning of class hierarchies

Abstract

The need for labeled data is among the most common and well-known practical obstacles to deploying deep learning algorithms to solve real-world problems. The current generation of learning algorithms requires a large volume of data labeled according to a static and pre-defined schema. Conversely, humans can quickly learn generalizations based on large quantities of unlabeled data, and turn these generalizations into classifications using spontaneous labels, often including labels not seen before. We apply a state-of-the-art unsupervised learning algorithm to the noisy and extremely imbalanced xView data set to train a feature extractor that adapts to several tasks: visual similarity search that performs well on both common and rare classes; identifying outliers within a labeled data set; and learning a natural class hierarchy automatically.

Tables3

Table 1. Table 1: Summary statistics for our classification data set derived from xView.

	Training	Test
Minimum Chips per Class	17	2
Median Chips per Class	629	125
Maximum Chips per Class	307221	100899
Total Chips	589119	187156

Table 2. Table 2: Classification using supervised and unsupervised learning methods with a ResNet18 backbone. Winning unsupervised method in bold.

Method	Top-1 Accuracy	Top-5 Accuracy
Supervised Random Init.	35.8	50.5
Supervised Fine-tuned	42.4	65.6
Autoencoder Random Init.	3.8	20.5
Autoencoder Fine-tuned.	3.6	19.9
UFL Random Init.	6.8	28.9
UFL Pre-trained, not Fine-tuned	12.9	47.6
UFL Fine-tuned	18.3	54.5

Table 3. Table 3: Detailed results for ResNet18 UFL fine-tuned from ImageNet.

Class	Train Population	Test Population	Top-1	Top-5
Building	307221	100899	98.32	99.89
Building Aircraft Hangar	180	97	4.12	64.95
Building Damaged	1036	306	1.31	44.77
Building Facility	823	316	2.85	71.2
Building Hut-Tent	703	126	0	33.33
Building Shed	1176	355	0.28	29.86
Construction Site	1033	459	67.54	90.85
Container Lot	2120	871	39.95	81.17
Engineering Vehicle (EV)	204	46	4.35	21.74
EV Cementmixer	287	87	2.3	49.43
EV Container Crane	159	46	2.17	54.35
EV Crane	173	46	4.35	10.87
EV Dump Truck	1344	349	4.01	45.85
EV Excavator	830	326	48.77	77.91
EV Grader	83	39	2.56	35.9
EV Haul Truck	325	40	35	95
EV Loader	626	370	9.46	55.68
EV Mobile Crane	313	76	0	26.32
EV Reach Stacker	69	30	0	20
EV Straddle Carrier	57	63	0	30.16
EV Tower Crane	144	56	0	42.86
EV Tractor-Scraper	78	19	0	5.26
Fixed Wing Aircraft (FWA)	73	39	7.69	25.64
FWA Cargo	633	323	82.35	97.21
FWA Small	354	110	48.18	84.55
Helicopter	68	94	10.64	37.23
Helipad	120	55	38.18	69.09
Maritime Vessel (MV)	633	143	12.59	57.34
MV Barge	171	62	0	35.48
MV Container	271	78	38.46	92.31
MV Ferry	183	16	12.5	56.25
MV Fishing	723	289	13.15	50.52
MV Motor	1447	154	22.08	59.09
MV Oil	64	23	8.7	65.22
MV Sail	692	24	20.83	50
MV Tug	209	45	8.89	75.56
MV Yacht	430	7	0	42.86
Passenger Vehicle (PV)	2949	1299	1.08	48.42
PV Bus	6865	2047	25.6	81.63
PV SmallCar	210827	61105	98.3	99.91
PV Pickup	1101	393	0	49.36
Pylon	349	124	41.13	80.65
Rail Vehicle (RV)	17	2	0	0
RV Cargo	1811	1843	64.35	91.05
RV Flat	123	70	28.57	67.14
RV Locomotive	116	37	0	24.32
RV Passenger	1567	383	27.68	81.2
RV Tank	120	83	36.14	80.72
Shipping Container	1570	612	2.29	50
Storage Tank	1625	807	40.89	82.78
Tower	84	19	0	10.53
Truck	12052	4369	7.87	78.51
Truck w/ Box Trailer	3562	767	10.3	58.41
Truck Cargo	5857	2035	2.01	57.05
Truck w/ Flatbed Trailer	883	262	1.15	23.66
Truck w/ Liquid Trailer	145	55	0	7.27
Truck Tractor Trailer	857	277	6.5	23.83
Truck Trailer	4045	1254	4.31	48.64
Truck Utility	3603	1705	0.35	49.85
Vehicle Lot	3936	1124	46.71	91.28

Equations8

P (i ∣ v) = \frac{e ^{w_{i}^{T} v}}{j = 1 \sum n e ^{w_{j}^{T} v}}

P (i ∣ v) = \frac{e ^{w_{i}^{T} v}}{j = 1 \sum n e ^{w_{j}^{T} v}}

P (i ∣ v) = \frac{e ^{(v_{i}^{T} v) / τ}}{j = 1 \sum n e ^{(v_{j}^{T} v) / τ}}

P (i ∣ v) = \frac{e ^{(v_{i}^{T} v) / τ}}{j = 1 \sum n e ^{(v_{j}^{T} v) / τ}}

J (θ) = - i = 1 \sum n lo g P (i ∣ f_{θ} (x_{i})) .

J (θ) = - i = 1 \sum n lo g P (i ∣ f_{θ} (x_{i})) .

\sum\limits_{v_{i}\in\mathcal{N}_{k}}\delta_{c_{i},c_{j}}e^{(v_{i}^{T}\hat{v})/\tau}\textrm{, where }\delta_{c_{i},c_{j}}=\left\{\begin{array}[]{ll}1&{\rm when~{}}c_{i}=c_{j}\\ 0&{\rm when~{}}c_{i}\neq c_{j}\end{array}\right\}

\sum\limits_{v_{i}\in\mathcal{N}_{k}}\delta_{c_{i},c_{j}}e^{(v_{i}^{T}\hat{v})/\tau}\textrm{, where }\delta_{c_{i},c_{j}}=\left\{\begin{array}[]{ll}1&{\rm when~{}}c_{i}=c_{j}\\ 0&{\rm when~{}}c_{i}\neq c_{j}\end{array}\right\}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\authorinfo

Further author information: (Send correspondence to A.R. and S.K.)

A.R.: E-mail: [email protected]

S.K.: E-mail: [email protected]

Unsupervised Feature Learning in Remote Sensing

Aaron Reite\supita111 These authors contributed equally to this work.

Scott Kangas\supitb11footnotemark: 1

Zackery Steck\supitb

Steven Goley\supitb

Jonathan Von Stroh\supitc

and Steven Forsyth\supitd

\skiplinehalf\supitaNGA Research

7500 GEOINT Dr

Springfield

VA

USA;

\supitbEtegent Technologies

Ltd

5050 Section Ave

Suite 110

Cincinnati

OH

USA;

\supitcCACI

15955 E Centretech Pkwy

Aurora

CO

USA;

\supitdNVIDIA

2788 San Tomas Expressway

Santa Clara

CA

USA

Abstract

The need for labeled data is among the most common and well-known practical obstacles to deploying deep learning algorithms to solve real-world problems. The current generation of learning algorithms requires a large volume of data labeled according to a static and pre-defined schema. Conversely, humans can quickly learn generalizations based on large quantities of unlabeled data, and turn these generalizations into classifications using spontaneous labels, often including labels not seen before. We apply a state-of-the-art unsupervised learning algorithm to the noisy and extremely imbalanced xView data set to train a feature extractor that adapts to several tasks: visual similarity search that performs well on both common and rare classes; identifying outliers within a labeled data set; and learning a natural class hierarchy automatically.

keywords:

remote sensing, unsupervised learning, deep learning, classification, similarity search, anomaly detection, hierarchy discovery

1 Introduction

In recent years, the-state-of-the-art for nearly every benchmark task in computer vision has been accomplished with deep convolutional neural networks (CNNs) trained via supervised learning on large, labeled data sets.[1, 2] However, obtaining high quality labeled data sets at the scale required for successfully training deep CNNs is costly, or even impossible, for many real-world computer vision applications, such as those requiring proprietary data, expert human data labeling, or those with limited real-world examples.

Additionally, most of these CNNs are trained in static scenarios where a fixed number of labels are encountered, and these labels are consistent between training and testing. In practice, this “closed-world” assumption is often violated: instances may contain multiple classes making hard labels problematic; the domain may change as new classes arise; more fine-grained labels may be desired, etc. Transfer learning is a common method to address these issues; starting with a trained model, the final layers are removed and replaced with randomly initialized layers before undergoing additional training to adapt to the new data distribution. Of course, this new model requires even more labeled data and is subject to its own “closed-world” so is similarly brittle if the test distribution changes again.

In this paper, we investigate unsupervised learning techniques applied to remotely sensed imagery, with the dual goal of training a feature extractor that is readily adaptable to new tasks and that does not depend on any labeled data for training. Specifically, we use Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination (UFL) developed by Wu, Xiong, Yu and Lin at the University of California Berkeley, which surpassed the state-of-the-art unsupervised learning methods for classification on ImageNet 1K in 2018[3]. The authors of UFL observed that classes that appear visually correlated generally receive higher softmax output scores than classes that are visually uncorrelated (e.g., Figure 1). Based on this observation, the authors developed an unsupervised learning approach to discriminate between individual instances, completely ignoring class labels and allowing the network to learn similarity between instances without a need for semantic categories. Treating each instance as a class results in computational challenges: for ImageNet the number of “classes” expands from the original 1000 to 1.2 million, i.e. the number of images in the training set. The authors of UFL address these computational challenges with a low-dimensional memory bank and a noise-contrastive estimate of their non-parametric softmax classifier. We discuss the UFL method in detail in Section 2.2.

The data we use in our experiments is derived from the xView object detection data set[4]. Although originally intended to advance the state-of-the-art for object detection in overhead imagery, we use the provided annotation bounding boxes to extract image chips around each object for the purpose of classification. Further detail about xView and our extracted image chips is covered in Section 2.1.

We use our UFL trained feature extractor for several, diverse computer vision tasks. First, we emulate the UFL paper’s unsupervised classification experiment on our data in 3.1. We find that UFL greatly outperforms a common competing unsupervised method (autoencoder), and even beats a fully supervised counterpart in top-5. We then demonstrate the generalizability of our UFL trained feature extractor by applying it to three disparate tasks: we show that a UFL trained feature extractor may be used in a similarity search algorithm that performs well even with highly imbalanced classes in 3.2; we identify outliers and errors in xView class labels in 3.3; and we learn a visual hierarchy automatically in 3.4.

2 Approach

2.1 Data Set

XView is the Defense Innovation Unit (DIU) and the National Geospatial-Intelligence Agency’s (NGA) large-scale detection data set that contains 1,413 km2 of 3-band panchromatic sharpened WorldView-3 satellite imagery. The images were collected at 0.3m ground sample distance, with 60 types of objects and land use categories exhaustively labeled with axis aligned bounding boxes, for a total of 1M+ bounding boxes.[4] During the 2018 xView Detection Challenge, a significant portion of these data were publicly released: 847 1km2 images with 601,937 bounding box labels for training, and 281 1km2 additional images without labels[5]. We procured the bounding box labels for these 281 additional images for our research, which we reserved for testing.

The objects in xView present unique challenges that differ from standard detection tasks such as Microsoft’s Common Objects in Context (COCO)[6]. For example, most of the xView objects are very small (10s of pixels) and have no standard orientation—they are rotated in different directions. Perhaps most challenging is the extreme class imbalance present in xView. If the data were class balanced, each of the 60 classes would have 10K examples for training. However, just two classes (Small Car and Building) account for almost 90% of the training data, so most classes have far fewer examples: many have 100s of examples, while some have just 10s. Figure 2 demonstrates xView’s extreme class imbalance. The first place solution to the 2018 xView Detection Challenge cites class imbalance as the most difficult obstacle to building a successful detector, and develops a novel loss function based on Focal Loss to assist[7, 8].

We created a classification data set from xView by cropping square image chips centered around each bounding box, using the box’s longest dimension. We chose square chips to allow ingestion into standard CNNs without resizing each dimension independently, and to allow some neighboring pixels for each object in the chip to provide context. For example, boat chips include neighboring water pixels and car chips include neighboring road pixels. We explored alternative methods to create image chips that would allow varying amounts of context and still provide square chips, such as taking rectangular crops proportionately longer than each side of the bounding box and then zero-padding to make a square; however, we found no measurable benefit in our experiments to these more complicated methods. We discarded chips that cross image boundaries and therefore contain incomplete objects. Table 1 contains a few key statistics about our data set.

Like many computer vision data sets, xView contains labeling errors. While deep learning has proven to be robust to labeling errors in well-sampled classes, the effect of these errors increases dramatically in classes containing few instances. As an extreme example, our test set for Railway Vehicle contains just two instances, both of which are labeled incorrectly (they are identifiable as specific types of railway vehicles). Nonetheless, after extensive experimentation with customized classes derived from the original 60, we eventually elected to keep all 60 classes as designated by the original xView paper to ease repeatability. Accordingly, our results may be significantly improved by eliminating a few very small or erroneous classes (e.g., Railway Vehicle), merging a few classes that are visually indistinct (e.g., Small Car and Passenger Vehicle), and splitting a few classes that contain visually distinct sub-classes (e.g., Tower).

2.2 Unsupervised Feature Learning (UFL)

Figure 3 shows the overall pipeline for the UFL approach. A standard CNN is used to embed each image as a feature vector, which is then $L^{2}$ normalized prior to being passed to a non-parametric softmax classifier for instance level discrimination (described in Section 2.2.1). The feature embedding is trained to maximally distribute the embedded features over the unit hypersphere. For evaluation, the authors implemented a KNN classifier using cosine similarity between the feature vector for a test image and those for the images used during training.

Benefits of this unsupervised approach are two-fold: annotations are not needed at training time thereby eliminating the burdensome task of data labelling to train a feature extractor, and the method is agnostic to network architecture so it can be implemented on any current or future state-of-the-art network design.

2.2.1 Non-Parametric Instance Discrimination

Let $f_{\theta}$ be a CNN with parameters $\theta$ mapping images $x_{i}$ to feature vectors $v_{i}=f_{\theta}(x_{i})$ . A conventional parametric softmax classifier will classify image $x$ as class $i$ with probability:

[TABLE]

where $n$ is the number of classes, $w_{j}$ is a learned weight vector for class $j$ , and the inner product $w_{j}^{T}v$ computes the similarity between $v=f_{\theta}(x)$ and class $j$ .

Instead of computing the similarity between a feature vector $v$ and the class weight vectors $w_{j}$ , which may be interpreted as class prototypes, the authors of UFL modified Equation 1 to compute the similarity of $v$ to other feature vectors: $v_{j}^{T}v$ . As all feature vectors have unit norm, this is the cosine similarity between $v$ and $v_{j}$ . Thus, in UFL Equation 1 becomes:

[TABLE]

where $\tau$ is a parameter controlling the concentration of the $v_{j}$ on the unit sphere and $n$ is the total number of instances in the training set. $P(i|v)$ approaches $1$ if $v$ is close to $v_{i}$ and far away from all other $v_{j}$ , or approaches [math] if $v$ is far away from $v_{i}$ (so long as $\tau$ is chosen to be sufficiently small: we used the UFL paper’s recommended value $\tau=0.07$ ).

Unfortunately, calculating the non-parametric softmax in Equation 2 may be computationally prohibitive when each image is considered its own class, e.g. ImageNet has 1.2M annotated images all of which require feature vectors for the computation. Instead of repeatedly computing each feature vector $v_{j}$ , the feature vectors are stored in a memory bank $V=\{v_{j}\mid 0\leq j\leq n\}$ . The dimension of the feature space is chosen to be relatively low: the hypersphere in ${\rm I\!R^{128}}$ (in both our work and the UFL paper, differing values for the feature dimension were explored, but none offered significant benefit). This choice allows all 1.2M feature vectors in ImageNet to require only 600 Mb of memory.

The learning objective is to minimize the negative log-likelihood over the training set:

[TABLE]

The loss therefore depends on how close each $f_{\theta}(x_{i})$ is from $x_{i}$ ’s feature vector during the previous iteration, $v_{i}$ , as well as how far $f_{\theta}(x_{i})$ is from all the other feature vectors $v_{j}$ . As such, a minimal solution for $\theta$ will result in an equidistribution of $v_{j}$ over the sphere. As $f_{\theta}$ has a finite number of convolutional filters and the embedding space $S^{127}$ is compact, a minimal solution forces the $v_{j}$ which activate the same convolutional filters (i.e., are visually similar) to be close together. During each learning iteration, all network parameters $\theta$ and the feature vector $f_{\theta}(x_{i})$ are updated via stochastic gradient descent and $v_{i}\in V$ is replaced with $f_{\theta}(x_{i})$ .

The burden of computing Equations 2 and 3 may be even further, and dramatically, reduced by employing noise-contrastive estimation to replace the denominator of Equation 2 with a constant computed from a Monte Carlo approximation during the initial few batches[9]. This reduces the learning objective to a much simpler form. We use the same normalizing constant and computational approximations as the UFL authors and refer the reader to Section 3.2 of Wu et al for additional details[3].

Note that if random augmentation is employed during training, such as flips, rotations, color jitter, etc., then each feature vector $v_{i}\in V$ may be interpreted as a class prototype for the class created from image $x_{i}$ by applying random augmentations. As such, $f_{\theta}(x_{i})$ can not equal $v_{i}$ even if the network parameters $\theta$ remain unchanged. This provides a consistent learning signal to teach the network to become invariant to augmentations, rather than just spread the feature vectors evenly across the sphere.

2.2.2 Weighted KNN Classification

In order to classify an image $\hat{x}$ in our validation set, we first compute its feature $\hat{v}=f_{\theta}(\hat{x})$ and compare it to all of the feature vectors $v_{i}\in V$ using cosine similarity: $v_{i}^{T}\hat{v}$ . We then take the $k$ feature vectors in the memory bank that are closest to $\hat{v}$ with respect to cosine similarity—the $k$ nearest neighbors, $\mathcal{N}_{k}$ . Finally, $\hat{x}$ is classified by a weighted voting of the classes in $\mathcal{N}_{k}$ ; specifically, if we denote the class of image $x_{j}$ corresponding to memory bank vector $v_{j}\in\mathcal{N}_{k}$ by $c_{j}$ , then class $c_{j}$ ’s vote is:

[TABLE]

where $\tau$ controls how $v_{i}$ ’s distance from $\hat{v}$ effects its vote. With $\tau=0.07$ , as per training, a feature vector that is close to $\hat{v}$ will count $\sim$ 6x more than a feature vector that is 30 degrees away. Nonetheless, $k$ must be chosen to limit the size of $\mathcal{N}_{k}$ , particularly in the case of imbalanced classes; otherwise, the vote will be dominated by the most populous classes despite small values of $\tau$ . We use $k=50$ .

3 Experiments

In all of our experiments, we first learn a robust feature extractor in an unsupervised setting by applying UFL to our data set. The performance of this feature extractor is judged by the top-5 unsupervised classification score on our test set.

3.1 Unsupervised Classification

We use ResNet18 as the backbone CNN with a low dimension of 128. As noted, we use $\tau=0.07$ and $k=50$ , with an initial learning rate of $0.03$ . We quickly discovered that pre-training on ImageNet results in dramatic improvements; in fact, pre-training on ImageNet without fine-tuning readily beats training for 200 epochs from randomly initialized weights, clearly demonstrating UFL’s robustness to domain adaptation. Fine-tuning requires a smaller learning rate and an aggressive decay schedule (we use a learning rate of $0.001$ and decay by a factor $0.5$ every two epochs). In addition to our UFL models, we train two additional ResNet18-based models in order to establish baseline accuracy performance:

•

An autencoder with reconstruction loss and a 128-dimensional embedding space. During evaluation the decoder stage of the network was removed and the weighted classification described in Section 2.2.2 was implemented.

•

A supervised model using softmax and cross-entropy loss and class-balanced sampling. This model represents the standard technique given the benefit of fully labeled data and should represent a ceiling on unsupervised performance.

All models were trained with random initialization as well as fine-tuning after being pre-trained on ImageNet.

Class-balanced sampling at train time as we used in our supervised model significantly improves our UFL results; however, we chose not to use this technique as it requires prior knowledge of the class labels and is contrary to the assumptions of unsupervised learning.

Given the large disparity in class populations in our data set, we report our top-1 and top-5 results averaged over the 60 classes, instead of averaged over all instances. This is necessary because, as noted, the Small Car and Building classes alone count for 88% of our data (36% and 52% respectively). Given a building in the test set, a random guess using the distribution of our training data will result in a correct classification 52% of the time and a correct classification in 5 guesses (i.e., top-5) 97% of the time. As such, a random network is expected to produce top-1 and top-5 scores of 40.1 and 83.2 (respectively) when averaged over all images, but only 1.7 and 4.1 (respectively) when averaged over all 60 classes. Table 2 summarizes the results from our classification models on our data set.

All of the UFL models beat the autoencoders by significant margins. Notably, pre-training UFL on ImageNet without fine-tuning on xView (UFL Pre-trained, not Fine-tuned) more than doubles the autoencoders’ scores and produces a top-5 score comparable to our randomly initialized fully-supervised model. If fine-tuned, UFL beats our randomly initialized fully-supervised model’s top-5 score by 8%, despite having no labeled data during training.

To examine the affects of increasing the backbone CNN’s depth and capacity, we trained UFL using ResNet50 which resulted in higher scores: top-1 = 20.2 and top-5 = 55.2 (fine-tuned from ImageNet). We hypothesize that even deeper CNN backbones will result in better accuracy. However, the purpose of our work is to demonstrate generalizibility of UFL when applied to diverse tasks in remote sensing imagery, not set or best benchmark scores, so we use ResNet18 in our following experiments as it is nearly as accurate, but much faster with many fewer parameters. Detailed class-by-class results for our fine-tuned ResNet18 UFL model are included in Table 3 within Appendix A.

3.2 Similarity Search

The challenge of similarity search, or image retrieval, is this: given a query image as input, return the image in a data set most similar to the query. Traditionally, Content-Based Image Retrieval (CBIR) has required large labelled data sets for training and expensive feature-point calculations to extract important visual information[10]. However, this expectation is becoming infeasible in a data-driven world where data sets may be petabytes in size; in such cases unsupervised deep learning approaches are preferable. The appeal of using UFL for image retrieval is that it does not require knowledge of class labels for training or image retrieval, thus it is completely unsupervised. Likewise, the foundation of UFL is that visually similar objects will be close together in the embedding space, making it an ideal candidate for CBIR.

We implement an image-retrieval algorithm using the fine-tuned UFL and autoencoder networks described in Section 3.1 by modifying the weighted KNN classifier described in Section 2.2.2. Instead of collecting the nearest $k$ class labels for a test image (i.e. query), we collect the nearest $k$ instance indices, which are used to retrieve their respective images from the training data (i.e. query results). For both networks, we return the nearest five images for four query classes—Building, Small Car, Helicopter, and Aircraft. Helicopter and Aircraft are considered rare or low-shot classes, containing only 68 and 73 instances per class respectively.

Figure 4 shows the UFL architecture performs well for the Building (row 1), Small Car (row 2), and Aircraft queries (row 4). Although the results for the Helicopter query (row 3) are incorrect, they all contain multiple visual similarities: dark orthogonal lines or shadows similar to rotors and small enclosed objects similar to a helicopter’s fuselage. It’s worth noting that the Aircraft query achieved acceptable results with only five more samples in its source class than the Helicopter class.

Figure 5 shows that the autoencoder obtains similar results for both Building and Small Car queries, but is outperformed by UFL in the rare or low-shot classes. The Helicopter query (row 3) and Aircraft query (row 4) lack any visual similarity to the query image beyond background color—reconstruction loss fails to embed visually similar objects in close proximity.

In addition to the similarity search using the xView dataset, we train UFL on the UC Merced Land Use Dataset and project the resultant feature vectors into a 2-D map, using t-SNE [11, 12, 13]. As shown in Figure 6, images with similar content have feature vectors that are close in the embedding space.

3.3 Identifying Outliers

Procuring large annotated data sets often results in noisy labels, as we discussed for xView in 2.1. We use our UFL trained network to assist in identifying potentially erroneous labels. As UFL does not make use of labels during training, it will not attempt to tighten intra-class samples nor will it repel inter-class samples; as such, it is a great candidate to detect outliers from a training set. For this purpose, we (1) train UFL in the standard manner (fine-tuned from ImageNet), (2) create a $K-D$ tree for the training feature vector set, (3) compute the distance to nearest neighbor intra-class feature vectors for each instance, (4) compute the intra-class mean of all such nearest neighbor distances, and (5) identify instances with a distance greater than two standard deviations from their intra-class mean.

Figure 7 shows some examples of outliers found using this methodology. This technique is able to identify incorrectly labeled instances as well as anomalous but correctly labeled instances, such as those that are obscured, crowded or appear with unusual backgrounds.

3.4 Unsupervised Learning of Visual Hierarchies

A class hierarchy is a directed, acyclic graph where each class is a node, and edges between nodes express the relationship “is a”; for example, a Locomotive is a Rail Vehicle. Many human-created hierarchies have been used for machine learning purposes, such as WordNet (from which ImageNet labels are derived), but these hierarchies do not always make sense from a computer vision perspective as they often use non-visual relationships[14]. For example, a human may create a hierarchy in which Trailer is close to Truck Tractor w/ Box Trailer, but far from Shipping Container. However, from the computer vision perspective, Trailer and Shipping Container are nearly identical—often they may only be distinguished if a trailer’s wheels are visible, which may not be possible given the orientation of an imaging satellite.

We follow the method for learning hierarchies from a classifier’s predictions developed by Silva-Palacios, Ferri, and Ramírez-Quintana at the Universitat Politència de València[15]. For brevity, we only describe the method at a high level: consult Section 2.2 of Silva-Palacios’ paper for details. First, we create a confusion matrix $M$ from UFL classification predictions, as discussed in 3.1. We then create the similarity matrix $D$ from $M$ by applying three functions consecutively. $D$ is symmetric with entries between 0 and 1. Entries corresponding to classes that are commonly confused (or predicted correctly; i.e., the diagonal entries) are close to 0, while classes that are rarely confused have entries close to 1. Finally, we apply a standard agglomerative hierarchical clustering algorithm and plot the associated dendrogram.

We perform this hierarchy learning for both top-1 and top-5 classification decisions (Figures 8 and 9). We observe that some portions of the hierarchy are natural and intuitive: buildings seem to be closely related to other types of buildings, and the same for many types of vehicles, rail cars, and maritime vessels. Other relationships which are less intuitive also occur; for example, Helipad is closely related to both Truck Tractor and Truck in the top-1 hierarchy. One hypothesis is that most Helipads in xView are made of concrete with large, rectangular painted lines, while Truck Tractors and Trucks both are large, rectangular vehicles that are often surrounded by concrete.

4 Summary

In this paper, we compare UFL against similar unsupervised and supervised architectures for classification of objects in xView. We observe that UFL fine-tuned from ImageNet dramatically outperforms autoencoder models and approaches supervised model performance in top-5 accuracy. We also show that UFL trained models can be used for visual similarity searches that perform well on both common and rare or low-shot classes. Additionally, we use UFL to identify outliers within the labeled xView data set and to learn a visual class hierarchy automatically.

In an upcoming paper, the authors will implement a hierarchical classifier using a Bayesian factor graph to estimate posterior probabilities over a spectrum of classes, ranging from very general at the top of the hierarchy to very specific at the bottom of the hierarchy. The goal is to improve accuracy given the benefit of a hierarchy, especially in cases of rare or low-shot classes.

Acknowledgements.

This work was supported by NGA / NVIDIA Cooperative Research and Development Agreement

HM0476CRFY17007, NGA / Etegent Technologies Ltd. contract HM047618C0071, and NRO / CACI contract 12-D-0227. It is approved for public release by National Geospatial Intelligence Agency #19-863. We gratefully acknowledge the support of NVIDIA for donating a DGX-1 from its PSG cluster for this research. We also thank Kyle Pula and Jonathan Howe for many helpful discussions and comments.

Appendix A Unsupervised Classification Results

Bibliography15

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1 , NIPS’12 , pp. 1097–1105, Curran Associates Inc., (USA), 2012.
2[2] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Image Net: A large-scale hierarchical image database,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2009.
3[3] Z. Wu, Y. Xiong, X. Y. Stella, and D. Lin, “Unsupervised feature learning via non-parametric instance discrimination,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2018.
4[4] D. Lam, R. Kuzma, K. Mc Gee, S. Dooley, M. Laielli, M. Klaric, Y. Bulatov, and B. Mc Cord, “x View: Objects in context in overhead imagery,” ar Xiv:1802.07856 , 2018.
5[5] “DI Ux x View 2018 detection challenge.” http://xviewdataset.org. Accessed: 2019-05-20.
6[6] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár, “Microsoft COCO: Common objects in context,” in The European Conference on Computer Vision (ECCV) , 2014.
7[7] N. Sergievskiy and A. Ponamarev, “Reduced focal loss: 1st place solution to xview object detection in satellite imagery,” ar Xiv:1903.01347 , 2019.
8[8] T. Lin, P. Goyal, R. B. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in The IEEE International Conference on Computer Vision (ICCV) , October 2017.