Unsupervised Feature Learning in Remote Sensing
Aaron Reite, Scott Kangas, Zackery Steck, Steven Goley, Jonathan Von, Stroh, and Steven Forsyth

TL;DR
This paper demonstrates the application of a state-of-the-art unsupervised learning algorithm to remote sensing data, enabling effective feature extraction for various tasks without requiring labeled data.
Contribution
It introduces an unsupervised feature learning approach tailored for remote sensing data, capable of handling noisy, imbalanced datasets and multiple tasks.
Findings
Effective visual similarity search across classes
Successful outlier detection in imbalanced data
Automatic learning of class hierarchies
Abstract
The need for labeled data is among the most common and well-known practical obstacles to deploying deep learning algorithms to solve real-world problems. The current generation of learning algorithms requires a large volume of data labeled according to a static and pre-defined schema. Conversely, humans can quickly learn generalizations based on large quantities of unlabeled data, and turn these generalizations into classifications using spontaneous labels, often including labels not seen before. We apply a state-of-the-art unsupervised learning algorithm to the noisy and extremely imbalanced xView data set to train a feature extractor that adapts to several tasks: visual similarity search that performs well on both common and rare classes; identifying outliers within a labeled data set; and learning a natural class hierarchy automatically.
| Training | Test | |
|---|---|---|
| Minimum Chips per Class | 17 | 2 |
| Median Chips per Class | 629 | 125 |
| Maximum Chips per Class | 307221 | 100899 |
| Total Chips | 589119 | 187156 |
| Method | Top-1 Accuracy | Top-5 Accuracy |
|---|---|---|
| Supervised Random Init. | 35.8 | 50.5 |
| Supervised Fine-tuned | 42.4 | 65.6 |
| Autoencoder Random Init. | 3.8 | 20.5 |
| Autoencoder Fine-tuned. | 3.6 | 19.9 |
| UFL Random Init. | 6.8 | 28.9 |
| UFL Pre-trained, not Fine-tuned | 12.9 | 47.6 |
| UFL Fine-tuned | 18.3 | 54.5 |
| Class | Train Population | Test Population | Top-1 | Top-5 |
| Building | 307221 | 100899 | 98.32 | 99.89 |
| Building Aircraft Hangar | 180 | 97 | 4.12 | 64.95 |
| Building Damaged | 1036 | 306 | 1.31 | 44.77 |
| Building Facility | 823 | 316 | 2.85 | 71.2 |
| Building Hut-Tent | 703 | 126 | 0 | 33.33 |
| Building Shed | 1176 | 355 | 0.28 | 29.86 |
| Construction Site | 1033 | 459 | 67.54 | 90.85 |
| Container Lot | 2120 | 871 | 39.95 | 81.17 |
| Engineering Vehicle (EV) | 204 | 46 | 4.35 | 21.74 |
| EV Cementmixer | 287 | 87 | 2.3 | 49.43 |
| EV Container Crane | 159 | 46 | 2.17 | 54.35 |
| EV Crane | 173 | 46 | 4.35 | 10.87 |
| EV Dump Truck | 1344 | 349 | 4.01 | 45.85 |
| EV Excavator | 830 | 326 | 48.77 | 77.91 |
| EV Grader | 83 | 39 | 2.56 | 35.9 |
| EV Haul Truck | 325 | 40 | 35 | 95 |
| EV Loader | 626 | 370 | 9.46 | 55.68 |
| EV Mobile Crane | 313 | 76 | 0 | 26.32 |
| EV Reach Stacker | 69 | 30 | 0 | 20 |
| EV Straddle Carrier | 57 | 63 | 0 | 30.16 |
| EV Tower Crane | 144 | 56 | 0 | 42.86 |
| EV Tractor-Scraper | 78 | 19 | 0 | 5.26 |
| Fixed Wing Aircraft (FWA) | 73 | 39 | 7.69 | 25.64 |
| FWA Cargo | 633 | 323 | 82.35 | 97.21 |
| FWA Small | 354 | 110 | 48.18 | 84.55 |
| Helicopter | 68 | 94 | 10.64 | 37.23 |
| Helipad | 120 | 55 | 38.18 | 69.09 |
| Maritime Vessel (MV) | 633 | 143 | 12.59 | 57.34 |
| MV Barge | 171 | 62 | 0 | 35.48 |
| MV Container | 271 | 78 | 38.46 | 92.31 |
| MV Ferry | 183 | 16 | 12.5 | 56.25 |
| MV Fishing | 723 | 289 | 13.15 | 50.52 |
| MV Motor | 1447 | 154 | 22.08 | 59.09 |
| MV Oil | 64 | 23 | 8.7 | 65.22 |
| MV Sail | 692 | 24 | 20.83 | 50 |
| MV Tug | 209 | 45 | 8.89 | 75.56 |
| MV Yacht | 430 | 7 | 0 | 42.86 |
| Passenger Vehicle (PV) | 2949 | 1299 | 1.08 | 48.42 |
| PV Bus | 6865 | 2047 | 25.6 | 81.63 |
| PV SmallCar | 210827 | 61105 | 98.3 | 99.91 |
| PV Pickup | 1101 | 393 | 0 | 49.36 |
| Pylon | 349 | 124 | 41.13 | 80.65 |
| Rail Vehicle (RV) | 17 | 2 | 0 | 0 |
| RV Cargo | 1811 | 1843 | 64.35 | 91.05 |
| RV Flat | 123 | 70 | 28.57 | 67.14 |
| RV Locomotive | 116 | 37 | 0 | 24.32 |
| RV Passenger | 1567 | 383 | 27.68 | 81.2 |
| RV Tank | 120 | 83 | 36.14 | 80.72 |
| Shipping Container | 1570 | 612 | 2.29 | 50 |
| Storage Tank | 1625 | 807 | 40.89 | 82.78 |
| Tower | 84 | 19 | 0 | 10.53 |
| Truck | 12052 | 4369 | 7.87 | 78.51 |
| Truck w/ Box Trailer | 3562 | 767 | 10.3 | 58.41 |
| Truck Cargo | 5857 | 2035 | 2.01 | 57.05 |
| Truck w/ Flatbed Trailer | 883 | 262 | 1.15 | 23.66 |
| Truck w/ Liquid Trailer | 145 | 55 | 0 | 7.27 |
| Truck Tractor Trailer | 857 | 277 | 6.5 | 23.83 |
| Truck Trailer | 4045 | 1254 | 4.31 | 48.64 |
| Truck Utility | 3603 | 1705 | 0.35 | 49.85 |
| Vehicle Lot | 3936 | 1124 | 46.71 | 91.28 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
\authorinfo
Further author information: (Send correspondence to A.R. and S.K.)
A.R.: E-mail: [email protected]
S.K.: E-mail: [email protected]
Unsupervised Feature Learning in Remote Sensing
Aaron Reite\supita111 These authors contributed equally to this work.
Scott Kangas\supitb11footnotemark: 1
Zackery Steck\supitb
Steven Goley\supitb
Jonathan Von Stroh\supitc
and Steven Forsyth\supitd
\skiplinehalf\supitaNGA Research
7500 GEOINT Dr
Springfield
VA
USA;
\supitbEtegent Technologies
Ltd
5050 Section Ave
Suite 110
Cincinnati
OH
USA;
\supitcCACI
15955 E Centretech Pkwy
Aurora
CO
USA;
\supitdNVIDIA
2788 San Tomas Expressway
Santa Clara
CA
USA
Abstract
The need for labeled data is among the most common and well-known practical obstacles to deploying deep learning algorithms to solve real-world problems. The current generation of learning algorithms requires a large volume of data labeled according to a static and pre-defined schema. Conversely, humans can quickly learn generalizations based on large quantities of unlabeled data, and turn these generalizations into classifications using spontaneous labels, often including labels not seen before. We apply a state-of-the-art unsupervised learning algorithm to the noisy and extremely imbalanced xView data set to train a feature extractor that adapts to several tasks: visual similarity search that performs well on both common and rare classes; identifying outliers within a labeled data set; and learning a natural class hierarchy automatically.
keywords:
remote sensing, unsupervised learning, deep learning, classification, similarity search, anomaly detection, hierarchy discovery
1 Introduction
In recent years, the-state-of-the-art for nearly every benchmark task in computer vision has been accomplished with deep convolutional neural networks (CNNs) trained via supervised learning on large, labeled data sets.[1, 2] However, obtaining high quality labeled data sets at the scale required for successfully training deep CNNs is costly, or even impossible, for many real-world computer vision applications, such as those requiring proprietary data, expert human data labeling, or those with limited real-world examples.
Additionally, most of these CNNs are trained in static scenarios where a fixed number of labels are encountered, and these labels are consistent between training and testing. In practice, this “closed-world” assumption is often violated: instances may contain multiple classes making hard labels problematic; the domain may change as new classes arise; more fine-grained labels may be desired, etc. Transfer learning is a common method to address these issues; starting with a trained model, the final layers are removed and replaced with randomly initialized layers before undergoing additional training to adapt to the new data distribution. Of course, this new model requires even more labeled data and is subject to its own “closed-world” so is similarly brittle if the test distribution changes again.
In this paper, we investigate unsupervised learning techniques applied to remotely sensed imagery, with the dual goal of training a feature extractor that is readily adaptable to new tasks and that does not depend on any labeled data for training. Specifically, we use Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination (UFL) developed by Wu, Xiong, Yu and Lin at the University of California Berkeley, which surpassed the state-of-the-art unsupervised learning methods for classification on ImageNet 1K in 2018[3]. The authors of UFL observed that classes that appear visually correlated generally receive higher softmax output scores than classes that are visually uncorrelated (e.g., Figure 1). Based on this observation, the authors developed an unsupervised learning approach to discriminate between individual instances, completely ignoring class labels and allowing the network to learn similarity between instances without a need for semantic categories. Treating each instance as a class results in computational challenges: for ImageNet the number of “classes” expands from the original 1000 to 1.2 million, i.e. the number of images in the training set. The authors of UFL address these computational challenges with a low-dimensional memory bank and a noise-contrastive estimate of their non-parametric softmax classifier. We discuss the UFL method in detail in Section 2.2.
The data we use in our experiments is derived from the xView object detection data set[4]. Although originally intended to advance the state-of-the-art for object detection in overhead imagery, we use the provided annotation bounding boxes to extract image chips around each object for the purpose of classification. Further detail about xView and our extracted image chips is covered in Section 2.1.
We use our UFL trained feature extractor for several, diverse computer vision tasks. First, we emulate the UFL paper’s unsupervised classification experiment on our data in 3.1. We find that UFL greatly outperforms a common competing unsupervised method (autoencoder), and even beats a fully supervised counterpart in top-5. We then demonstrate the generalizability of our UFL trained feature extractor by applying it to three disparate tasks: we show that a UFL trained feature extractor may be used in a similarity search algorithm that performs well even with highly imbalanced classes in 3.2; we identify outliers and errors in xView class labels in 3.3; and we learn a visual hierarchy automatically in 3.4.
2 Approach
2.1 Data Set
XView is the Defense Innovation Unit (DIU) and the National Geospatial-Intelligence Agency’s (NGA) large-scale detection data set that contains 1,413 km2 of 3-band panchromatic sharpened WorldView-3 satellite imagery. The images were collected at 0.3m ground sample distance, with 60 types of objects and land use categories exhaustively labeled with axis aligned bounding boxes, for a total of 1M+ bounding boxes.[4] During the 2018 xView Detection Challenge, a significant portion of these data were publicly released: 847 1km2 images with 601,937 bounding box labels for training, and 281 1km2 additional images without labels[5]. We procured the bounding box labels for these 281 additional images for our research, which we reserved for testing.
The objects in xView present unique challenges that differ from standard detection tasks such as Microsoft’s Common Objects in Context (COCO)[6]. For example, most of the xView objects are very small (10s of pixels) and have no standard orientation—they are rotated in different directions. Perhaps most challenging is the extreme class imbalance present in xView. If the data were class balanced, each of the 60 classes would have 10K examples for training. However, just two classes (Small Car and Building) account for almost 90% of the training data, so most classes have far fewer examples: many have 100s of examples, while some have just 10s. Figure 2 demonstrates xView’s extreme class imbalance. The first place solution to the 2018 xView Detection Challenge cites class imbalance as the most difficult obstacle to building a successful detector, and develops a novel loss function based on Focal Loss to assist[7, 8].
We created a classification data set from xView by cropping square image chips centered around each bounding box, using the box’s longest dimension. We chose square chips to allow ingestion into standard CNNs without resizing each dimension independently, and to allow some neighboring pixels for each object in the chip to provide context. For example, boat chips include neighboring water pixels and car chips include neighboring road pixels. We explored alternative methods to create image chips that would allow varying amounts of context and still provide square chips, such as taking rectangular crops proportionately longer than each side of the bounding box and then zero-padding to make a square; however, we found no measurable benefit in our experiments to these more complicated methods. We discarded chips that cross image boundaries and therefore contain incomplete objects. Table 1 contains a few key statistics about our data set.
Like many computer vision data sets, xView contains labeling errors. While deep learning has proven to be robust to labeling errors in well-sampled classes, the effect of these errors increases dramatically in classes containing few instances. As an extreme example, our test set for Railway Vehicle contains just two instances, both of which are labeled incorrectly (they are identifiable as specific types of railway vehicles). Nonetheless, after extensive experimentation with customized classes derived from the original 60, we eventually elected to keep all 60 classes as designated by the original xView paper to ease repeatability. Accordingly, our results may be significantly improved by eliminating a few very small or erroneous classes (e.g., Railway Vehicle), merging a few classes that are visually indistinct (e.g., Small Car and Passenger Vehicle), and splitting a few classes that contain visually distinct sub-classes (e.g., Tower).
2.2 Unsupervised Feature Learning (UFL)
Figure 3 shows the overall pipeline for the UFL approach. A standard CNN is used to embed each image as a feature vector, which is then normalized prior to being passed to a non-parametric softmax classifier for instance level discrimination (described in Section 2.2.1). The feature embedding is trained to maximally distribute the embedded features over the unit hypersphere. For evaluation, the authors implemented a KNN classifier using cosine similarity between the feature vector for a test image and those for the images used during training.
Benefits of this unsupervised approach are two-fold: annotations are not needed at training time thereby eliminating the burdensome task of data labelling to train a feature extractor, and the method is agnostic to network architecture so it can be implemented on any current or future state-of-the-art network design.
2.2.1 Non-Parametric Instance Discrimination
Let be a CNN with parameters mapping images to feature vectors . A conventional parametric softmax classifier will classify image as class with probability:
[TABLE]
where is the number of classes, is a learned weight vector for class , and the inner product computes the similarity between and class .
Instead of computing the similarity between a feature vector and the class weight vectors , which may be interpreted as class prototypes, the authors of UFL modified Equation 1 to compute the similarity of to other feature vectors: . As all feature vectors have unit norm, this is the cosine similarity between and . Thus, in UFL Equation 1 becomes:
[TABLE]
where is a parameter controlling the concentration of the on the unit sphere and is the total number of instances in the training set. approaches if is close to and far away from all other , or approaches [math] if is far away from (so long as is chosen to be sufficiently small: we used the UFL paper’s recommended value ).
Unfortunately, calculating the non-parametric softmax in Equation 2 may be computationally prohibitive when each image is considered its own class, e.g. ImageNet has 1.2M annotated images all of which require feature vectors for the computation. Instead of repeatedly computing each feature vector , the feature vectors are stored in a memory bank . The dimension of the feature space is chosen to be relatively low: the hypersphere in (in both our work and the UFL paper, differing values for the feature dimension were explored, but none offered significant benefit). This choice allows all 1.2M feature vectors in ImageNet to require only 600 Mb of memory.
The learning objective is to minimize the negative log-likelihood over the training set:
[TABLE]
The loss therefore depends on how close each is from ’s feature vector during the previous iteration, , as well as how far is from all the other feature vectors . As such, a minimal solution for will result in an equidistribution of over the sphere. As has a finite number of convolutional filters and the embedding space is compact, a minimal solution forces the which activate the same convolutional filters (i.e., are visually similar) to be close together. During each learning iteration, all network parameters and the feature vector are updated via stochastic gradient descent and is replaced with .
The burden of computing Equations 2 and 3 may be even further, and dramatically, reduced by employing noise-contrastive estimation to replace the denominator of Equation 2 with a constant computed from a Monte Carlo approximation during the initial few batches[9]. This reduces the learning objective to a much simpler form. We use the same normalizing constant and computational approximations as the UFL authors and refer the reader to Section 3.2 of Wu et al for additional details[3].
Note that if random augmentation is employed during training, such as flips, rotations, color jitter, etc., then each feature vector may be interpreted as a class prototype for the class created from image by applying random augmentations. As such, can not equal even if the network parameters remain unchanged. This provides a consistent learning signal to teach the network to become invariant to augmentations, rather than just spread the feature vectors evenly across the sphere.
2.2.2 Weighted KNN Classification
In order to classify an image in our validation set, we first compute its feature and compare it to all of the feature vectors using cosine similarity: . We then take the feature vectors in the memory bank that are closest to with respect to cosine similarity—the nearest neighbors, . Finally, is classified by a weighted voting of the classes in ; specifically, if we denote the class of image corresponding to memory bank vector by , then class ’s vote is:
[TABLE]
where controls how ’s distance from effects its vote. With , as per training, a feature vector that is close to will count 6x more than a feature vector that is 30 degrees away. Nonetheless, must be chosen to limit the size of , particularly in the case of imbalanced classes; otherwise, the vote will be dominated by the most populous classes despite small values of . We use .
3 Experiments
In all of our experiments, we first learn a robust feature extractor in an unsupervised setting by applying UFL to our data set. The performance of this feature extractor is judged by the top-5 unsupervised classification score on our test set.
3.1 Unsupervised Classification
We use ResNet18 as the backbone CNN with a low dimension of 128. As noted, we use and , with an initial learning rate of . We quickly discovered that pre-training on ImageNet results in dramatic improvements; in fact, pre-training on ImageNet without fine-tuning readily beats training for 200 epochs from randomly initialized weights, clearly demonstrating UFL’s robustness to domain adaptation. Fine-tuning requires a smaller learning rate and an aggressive decay schedule (we use a learning rate of and decay by a factor every two epochs). In addition to our UFL models, we train two additional ResNet18-based models in order to establish baseline accuracy performance:
- •
An autencoder with reconstruction loss and a 128-dimensional embedding space. During evaluation the decoder stage of the network was removed and the weighted classification described in Section 2.2.2 was implemented.
- •
A supervised model using softmax and cross-entropy loss and class-balanced sampling. This model represents the standard technique given the benefit of fully labeled data and should represent a ceiling on unsupervised performance.
All models were trained with random initialization as well as fine-tuning after being pre-trained on ImageNet.
Class-balanced sampling at train time as we used in our supervised model significantly improves our UFL results; however, we chose not to use this technique as it requires prior knowledge of the class labels and is contrary to the assumptions of unsupervised learning.
Given the large disparity in class populations in our data set, we report our top-1 and top-5 results averaged over the 60 classes, instead of averaged over all instances. This is necessary because, as noted, the Small Car and Building classes alone count for 88% of our data (36% and 52% respectively). Given a building in the test set, a random guess using the distribution of our training data will result in a correct classification 52% of the time and a correct classification in 5 guesses (i.e., top-5) 97% of the time. As such, a random network is expected to produce top-1 and top-5 scores of 40.1 and 83.2 (respectively) when averaged over all images, but only 1.7 and 4.1 (respectively) when averaged over all 60 classes. Table 2 summarizes the results from our classification models on our data set.
All of the UFL models beat the autoencoders by significant margins. Notably, pre-training UFL on ImageNet without fine-tuning on xView (UFL Pre-trained, not Fine-tuned) more than doubles the autoencoders’ scores and produces a top-5 score comparable to our randomly initialized fully-supervised model. If fine-tuned, UFL beats our randomly initialized fully-supervised model’s top-5 score by 8%, despite having no labeled data during training.
To examine the affects of increasing the backbone CNN’s depth and capacity, we trained UFL using ResNet50 which resulted in higher scores: top-1 = 20.2 and top-5 = 55.2 (fine-tuned from ImageNet). We hypothesize that even deeper CNN backbones will result in better accuracy. However, the purpose of our work is to demonstrate generalizibility of UFL when applied to diverse tasks in remote sensing imagery, not set or best benchmark scores, so we use ResNet18 in our following experiments as it is nearly as accurate, but much faster with many fewer parameters. Detailed class-by-class results for our fine-tuned ResNet18 UFL model are included in Table 3 within Appendix A.
3.2 Similarity Search
The challenge of similarity search, or image retrieval, is this: given a query image as input, return the image in a data set most similar to the query. Traditionally, Content-Based Image Retrieval (CBIR) has required large labelled data sets for training and expensive feature-point calculations to extract important visual information[10]. However, this expectation is becoming infeasible in a data-driven world where data sets may be petabytes in size; in such cases unsupervised deep learning approaches are preferable. The appeal of using UFL for image retrieval is that it does not require knowledge of class labels for training or image retrieval, thus it is completely unsupervised. Likewise, the foundation of UFL is that visually similar objects will be close together in the embedding space, making it an ideal candidate for CBIR.
We implement an image-retrieval algorithm using the fine-tuned UFL and autoencoder networks described in Section 3.1 by modifying the weighted KNN classifier described in Section 2.2.2. Instead of collecting the nearest class labels for a test image (i.e. query), we collect the nearest instance indices, which are used to retrieve their respective images from the training data (i.e. query results). For both networks, we return the nearest five images for four query classes—Building, Small Car, Helicopter, and Aircraft. Helicopter and Aircraft are considered rare or low-shot classes, containing only 68 and 73 instances per class respectively.
Figure 4 shows the UFL architecture performs well for the Building (row 1), Small Car (row 2), and Aircraft queries (row 4). Although the results for the Helicopter query (row 3) are incorrect, they all contain multiple visual similarities: dark orthogonal lines or shadows similar to rotors and small enclosed objects similar to a helicopter’s fuselage. It’s worth noting that the Aircraft query achieved acceptable results with only five more samples in its source class than the Helicopter class.
Figure 5 shows that the autoencoder obtains similar results for both Building and Small Car queries, but is outperformed by UFL in the rare or low-shot classes. The Helicopter query (row 3) and Aircraft query (row 4) lack any visual similarity to the query image beyond background color—reconstruction loss fails to embed visually similar objects in close proximity.
In addition to the similarity search using the xView dataset, we train UFL on the UC Merced Land Use Dataset and project the resultant feature vectors into a 2-D map, using t-SNE [11, 12, 13]. As shown in Figure 6, images with similar content have feature vectors that are close in the embedding space.
3.3 Identifying Outliers
Procuring large annotated data sets often results in noisy labels, as we discussed for xView in 2.1. We use our UFL trained network to assist in identifying potentially erroneous labels. As UFL does not make use of labels during training, it will not attempt to tighten intra-class samples nor will it repel inter-class samples; as such, it is a great candidate to detect outliers from a training set. For this purpose, we (1) train UFL in the standard manner (fine-tuned from ImageNet), (2) create a tree for the training feature vector set, (3) compute the distance to nearest neighbor intra-class feature vectors for each instance, (4) compute the intra-class mean of all such nearest neighbor distances, and (5) identify instances with a distance greater than two standard deviations from their intra-class mean.
Figure 7 shows some examples of outliers found using this methodology. This technique is able to identify incorrectly labeled instances as well as anomalous but correctly labeled instances, such as those that are obscured, crowded or appear with unusual backgrounds.
3.4 Unsupervised Learning of Visual Hierarchies
A class hierarchy is a directed, acyclic graph where each class is a node, and edges between nodes express the relationship “is a”; for example, a Locomotive is a Rail Vehicle. Many human-created hierarchies have been used for machine learning purposes, such as WordNet (from which ImageNet labels are derived), but these hierarchies do not always make sense from a computer vision perspective as they often use non-visual relationships[14]. For example, a human may create a hierarchy in which Trailer is close to Truck Tractor w/ Box Trailer, but far from Shipping Container. However, from the computer vision perspective, Trailer and Shipping Container are nearly identical—often they may only be distinguished if a trailer’s wheels are visible, which may not be possible given the orientation of an imaging satellite.
We follow the method for learning hierarchies from a classifier’s predictions developed by Silva-Palacios, Ferri, and Ramírez-Quintana at the Universitat Politència de València[15]. For brevity, we only describe the method at a high level: consult Section 2.2 of Silva-Palacios’ paper for details. First, we create a confusion matrix from UFL classification predictions, as discussed in 3.1. We then create the similarity matrix from by applying three functions consecutively. is symmetric with entries between 0 and 1. Entries corresponding to classes that are commonly confused (or predicted correctly; i.e., the diagonal entries) are close to 0, while classes that are rarely confused have entries close to 1. Finally, we apply a standard agglomerative hierarchical clustering algorithm and plot the associated dendrogram.
We perform this hierarchy learning for both top-1 and top-5 classification decisions (Figures 8 and 9). We observe that some portions of the hierarchy are natural and intuitive: buildings seem to be closely related to other types of buildings, and the same for many types of vehicles, rail cars, and maritime vessels. Other relationships which are less intuitive also occur; for example, Helipad is closely related to both Truck Tractor and Truck in the top-1 hierarchy. One hypothesis is that most Helipads in xView are made of concrete with large, rectangular painted lines, while Truck Tractors and Trucks both are large, rectangular vehicles that are often surrounded by concrete.
4 Summary
In this paper, we compare UFL against similar unsupervised and supervised architectures for classification of objects in xView. We observe that UFL fine-tuned from ImageNet dramatically outperforms autoencoder models and approaches supervised model performance in top-5 accuracy. We also show that UFL trained models can be used for visual similarity searches that perform well on both common and rare or low-shot classes. Additionally, we use UFL to identify outliers within the labeled xView data set and to learn a visual class hierarchy automatically.
In an upcoming paper, the authors will implement a hierarchical classifier using a Bayesian factor graph to estimate posterior probabilities over a spectrum of classes, ranging from very general at the top of the hierarchy to very specific at the bottom of the hierarchy. The goal is to improve accuracy given the benefit of a hierarchy, especially in cases of rare or low-shot classes.
Acknowledgements.
This work was supported by NGA / NVIDIA Cooperative Research and Development Agreement
HM0476CRFY17007, NGA / Etegent Technologies Ltd. contract HM047618C0071, and NRO / CACI contract 12-D-0227. It is approved for public release by National Geospatial Intelligence Agency #19-863. We gratefully acknowledge the support of NVIDIA for donating a DGX-1 from its PSG cluster for this research. We also thank Kyle Pula and Jonathan Howe for many helpful discussions and comments.
Appendix A Unsupervised Classification Results
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1 , NIPS’12 , pp. 1097–1105, Curran Associates Inc., (USA), 2012.
- 2[2] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Image Net: A large-scale hierarchical image database,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2009.
- 3[3] Z. Wu, Y. Xiong, X. Y. Stella, and D. Lin, “Unsupervised feature learning via non-parametric instance discrimination,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2018.
- 4[4] D. Lam, R. Kuzma, K. Mc Gee, S. Dooley, M. Laielli, M. Klaric, Y. Bulatov, and B. Mc Cord, “x View: Objects in context in overhead imagery,” ar Xiv:1802.07856 , 2018.
- 5[5] “DI Ux x View 2018 detection challenge.” http://xviewdataset.org. Accessed: 2019-05-20.
- 6[6] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár, “Microsoft COCO: Common objects in context,” in The European Conference on Computer Vision (ECCV) , 2014.
- 7[7] N. Sergievskiy and A. Ponamarev, “Reduced focal loss: 1st place solution to xview object detection in satellite imagery,” ar Xiv:1903.01347 , 2019.
- 8[8] T. Lin, P. Goyal, R. B. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in The IEEE International Conference on Computer Vision (ICCV) , October 2017.
