Active Transfer Learning Network: A Unified Deep Joint Spectral-Spatial Feature Learning Model For Hyperspectral Image Classification
Cheng Deng, Yumeng Xue, Xianglong Liu, Chao Li, Dacheng Tao

TL;DR
This paper introduces a unified deep learning model that combines active transfer learning and spectral-spatial feature extraction to classify hyperspectral images effectively with minimal labeled data.
Contribution
It proposes a novel deep joint spectral-spatial feature learning framework integrated with active transfer learning, enabling effective hyperspectral image classification with limited labeled samples.
Findings
Outperforms state-of-the-art methods on three datasets
Effective training with limited labeled samples
Flexible across various transfer scenarios
Abstract
Deep learning has recently attracted significant attention in the field of hyperspectral images (HSIs) classification. However, the construction of an efficient deep neural network (DNN) mostly relies on a large number of labeled samples being available. To address this problem, this paper proposes a unified deep network, combined with active transfer learning that can be well-trained for HSIs classification using only minimally labeled training data. More specifically, deep joint spectral-spatial feature is first extracted through hierarchical stacked sparse autoencoder (SSAE) networks. Active transfer learning is then exploited to transfer the pre-trained SSAE network and the limited training samples from the source domain to the target domain, where the SSAE network is subsequently fine-tuned using the limited labeled samples selected from both source and target domain by…
| Pavia University | Pavia Center | ||
| Class | Reference data | Class | Reference data |
| Asphalt | 6631 | Water | 65971 |
| Meadows | 18649 | Trees | 7598 |
| Gravel | 2099 | Asphalt | 3090 |
| Trees | 3064 | Bricks | 2685 |
| Metal Sheets | 1345 | Bitumen | 6584 |
| Bare Soil | 5029 | Tiles | 9248 |
| Bitumen | 1330 | Shadows | 7287 |
| Bricks | 3682 | Meadows | 42826 |
| Shadows | 947 | Bare Soil | 2863 |
| Total | 42776 | Total | 148152 |
| Salinas Valley | |
| Class | Reference data |
| Greenweeds1 | 2009 |
| Greenweeds2 | 3726 |
| Fallow | 1976 |
| Rough Fallow | 1394 |
| Smooth Fallow | 2678 |
| Stubble | 3959 |
| Celery | 3579 |
| Grapes | 11271 |
| Vinyard Soil | 6203 |
| Corn | 3278 |
| Lettuce 4wk | 1068 |
| Lettuce 5wk | 1927 |
| Lettuce 6wk | 916 |
| Lettuce 7wk | 1070 |
| Untrain vinyard | 7268 |
| Vertical vinyard | 1807 |
| Total | 54129 |
| Number of hidden layers | Pavia University | Pavia Center | Salinas Valley |
| 1 | 98.59% | 99.34% | 92.52% |
| 2 | 98.78% | 99.46% | 94.60% |
| 3 | 98.75% | 99.49% | 94.24% |
| Number of hidden layers | Pavia University | Pavia Center | Salinas Valley |
| 1 | 99.33% | 99.63% | 92.69% |
| 2 | 99.35% | 99.65% | 95.13% |
| 3 | 99.13% | 99.69% | 94.45% |
| Number of training samples per class | OA | AA | Kappa | |
| 25 | 99.78% | 99.78% | 0.9970 | |
| Pavia | 50 | 99.85% | 99.81% | 0.9979 |
| University | 75 | 99.82% | 99.80% | 0.9975 |
| 100 | 99.80% | 99.77% | 0.9974 | |
| 25 | 99.84% | 99.49% | 0.9977 | |
| Pavia | 50 | 99.93% | 99.81% | 0.9990 |
| Center | 75 | 99.89% | 99.64% | 0.9984 |
| 100 | 99.87% | 99.60% | 0.9983 | |
| 25 | 99.02% | 99.45% | 0.9891 | |
| Salinas | 50 | 99.26% | 99.54% | 0.9918 |
| Valley | 75 | 99.19% | 99.55% | 0.9912 |
| 100 | 99.17% | 99.48% | 0.9908 |
| Number of training samples per class | OA | AA | Kappa |
| 25 | 99.57% | 99.56% | 0.9946 |
| 50 | 99.61% | 99.61% | 0.9948 |
| 75 | 99.58% | 99.57% | 0.9947 |
| 100 | 99.60% | 99.58% | 0.9948 |
| Number of training samples per class | OA | AA | Kappa |
| 25 | 99.80% | 99.35% | 0.9971 |
| 50 | 99.86% | 99.62% | 0.9980 |
| 75 | 99.83% | 99.44% | 0.9976 |
| 100 | 99.83% | 99.43% | 0.9975 |
| Ratio of training samples | OA | AA | Kappa |
| 5% | 98.61% | 99.27% | 0.9945 |
| 10% | 98.53% | 99.26% | 0.9936 |
| 15% | 98.51% | 99.09% | 0.9834 |
| 20% | 98.51% | 99.11% | 0.9833 |
| Datasets | Pavia University | Pavia Center | Salinas Valley | |
| Pretrained | training time (min) | 28.62 | 41.20 | 62.67 |
| Network | test time (min) | 0.013 | 0.039 | 0.026 |
| Transferred | training time (min) | 21.68 | 26.51 | 54.07 |
| Network | test time (min) | 0.012 | 0.039 | 0.023 |
| Data Set | training samples in source/target domain | OA | AA | Kappa |
| Pavia University | 200/300 | 99.47% | 99.21% | 0.9918 |
| Pavia Center | 150/300 | 99.88% | 99.68% | 0.9976 |
| Salinas Valley | 240/300 | 99.45% | 99.57% | 0.9912 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSparse Autoencoder · Solana Customer Service Number +1-833-534-1729
Active Transfer Learning Network: A Unified Deep Joint Spectral-Spatial Feature Learning Model For Hyperspectral Image Classification
Cheng Deng, Yumeng Xue, Xianglong Liu, Chao Li, Dacheng Tao Manuscript received April 19, 2005; revised August 26, 2015. This work was supported in part by the National Natural Science Foundation of China under Grant 61572388 and 61872021, in part by the Key R&D Program-The Key Industry Innovation Chain of Shaanxi under Grant 2017ZDCXL-GY-05-04-02, and in part by Australian Research Council Projects FL-170100117, DP-180103424, and IH180100002.Y. Xue, C. Deng, and C. Li are with the School of Electronic Engineering, Xidian University, Xi’an 710071, China (e-mail :[email protected]).X. Liu is with the State Key Lab of Software Development Environment, Beihang University, Beijing 100191, China (email: [email protected]).D. Tao is with the UBTECH Sydney Artificial Intelligence Centre and the School of Information Technologies, the Faculty of Engineering and Information Technologies, the University of Sydney, 6 Cleveland St, Darlington, NSW 2008, Australia (email: [email protected]). ©20XX IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Abstract
Deep learning has recently attracted significant attention in the field of hyperspectral images (HSIs) classification. However, the construction of an efficient deep neural network (DNN) mostly relies on a large number of labeled samples being available. To address this problem, this paper proposes a unified deep network, combined with active transfer learning that can be well-trained for HSIs classification using only minimally labeled training data. More specifically, deep joint spectral-spatial feature is first extracted through hierarchical stacked sparse autoencoder (SSAE) networks. Active transfer learning is then exploited to transfer the pre-trained SSAE network and the limited training samples from the source domain to the target domain, where the SSAE network is subsequently fine-tuned using the limited labeled samples selected from both source and target domain by corresponding active learning strategies. The advantages of our proposed method are threefold: 1) the network can be effectively trained using only limited labeled samples with the help of novel active learning strategies; 2) the network is flexible and scalable enough to function across various transfer situations, including cross-dataset and intra-image; 3) the learned deep joint spectral-spatial feature representation is more generic and robust than many joint spectral-spatial feature representation. Extensive comparative evaluations demonstrate that our proposed method significantly outperforms many state-of-the-art approaches, including both traditional and deep network-based methods, on three popular datasets.
Index Terms:
Deep learning, hyperspectral image classification, multiple feature representation, active learning, stacked sparse autoencoder (SSAE), transfer learning.
I Introduction
Owing to the rapid development of remote sensing technology, hundreds of nearly continuous spectral bands and an enormous amount of spatial information can be captured simultaneously via the hyperspectral sensors. Hyperspectral images (HSIs) have been widely utilized in diverse fields, such as precision agriculture [1], geological exploration [2], and environmental sciences [3][4], where land cover classes usually need to be identified using a small set of training samples through HSIs classification. However, some unfavorable factors also exist that can seriously decrease the classification accuracy when high dimensional spectral/spatial features are involved: 1) the so-called curse of dimensionality , in which high-dimensional spectral information hinders the extraction of available spectral properties; 2) inadequate use of the spectral and spatial information, which significantly restrains the performance of the classifier; 3) the limited availability of labeled samples makes it hard for the classifier to learn the distribution of the HSIs data completely.
Significant efforts have been made to solve these problems using a range of different approaches, which can be divided into three main categories: supervised classification method, unsupervised classification method, and semi-supervised classification method. In recent decades, some typical supervised classification algorithms including SVM [5] [6], k-nearest-neighbors, and logistic regression [7] have been investigated for HSIs. In [5], the kernel based SVM was first proposed to learn the feature representation of spectral bands. Besides, instead of using the full spectral bands for data processing, transformation [8] [9], principal component analysis (PCA) [10, 11, 12] and other unsupervised dimensionality reduction methods have been exploited to interpret the relevance of spectral bands. Furthermore, the semi-supervised classification methods for HSIs are provided with some available labeled data in addition with unlabeled data. Bruzzone et al. [13] proposed a semi-supervised transductive SVM to maximize the hyperplane between the labeled and the unlabeled samples simultaneously. In [14], a new semi-supervised HSIs classification algorithm is proposed to exploit both hard and soft labels for better modeling the phenomenon of mixed pixels present in HSIs.
Above mentioned methods almost only focus on the spectral bands in HSIs. In order to improving the interpretation of HSIs, many approaches have incorporated spectral feature with abundant spatial contextual information [15, 16, 17], e.g., extended morphological profiles (EMPs) [18], nonnegative matrix factorization [19] [20], and conditional random field [21]. However, these traditional low-level features are more sensitive to local changes occurred in input data, which greatly reduces the classification accuracy.
Deep neural network (DNN) has been proven to be able to automatically learn a hierarchical feature representation, which is more robust to HSIs classification [22, 23, 24]. This type of feature is invariant to local changes and thus more suitable for handling the variable spectral/spatial signatures in HSIs [23]. Deep belief network (DBN) [25], convolutional neural network (CNN) [26, 27, 28] and stacked autoencoder (SAE) [29, 30, 31] have been introduced into HSIs classification and have improved classification performance significantly. DBN is usually combined with PCA in order to learn a joint spectral-spatial feature for HSIs classification [32] [33], while CNN-based HSIs classification models tend to learn spatial information with 2-D patches [34]. In [29], SAE is first proposed to learn deep feature representation from stacked spectral-spatial feature. Tao et al. [31] exploited the stacked sparse autoencoder (SSAE) to extract deep sparse feature representation for HSIs classification, where deep multi-scale spatial features are learned from various patch sizes. In fact, DNN requires a large number of training samples to learn the parameters in different layers. Unfortunately, only a limited number of labeled samples are available for HSIs in practice, and labeling pixels manually is quite time-consuming.
Active learning (AL) and transfer learning (TL) are two popular techniques that have been employed to promote the training process by selecting some unlabeled data for labeling or by using knowledge obtained from related data. AL is an iterative procedure that involves selecting a small set of the most informative unlabeled samples with a query function to train a robust classifier. The training procedure utilizing active sampling data performs more efficiently, because these samples are more suitable for describing the distribution of the unlabeled data. AL for HSI classification was studied using a small set of training samples in [35, 36, 37, 38, 39, 40]. Some researchers have adopted AL to perform high-quality sample labeling in order to construct a well-trained DNN. Liu et al. [41] constructed a DBN using AL and fewer training samples than are typically used in traditional semi-supervised learning methods. Unlike AL, TL aims to propagate useful knowledge from a source domain to a target domain. In recent years, TL has been successfully applied in the remote sensing field [42, 43, 44, 45, 46] and has coped well with the variability of spectral/spatial information in related HSIs that are acquired by the same sensor at different time or locations . In [44], TL is combined with CNN to train the target data using auxiliary source data and a limited number of target samples.
Inspired by the ideas of AL and TL, this paper first presents a deep joint spectral-spatial feature representation model that incorporates with active transfer learning for HSI classification. Compared to the shallow methods, the novel deep joint spectral-spatial feature learning framework shows that the learned deep spectral-spatial feature representation is more discriminative for HSI classification and avoids designing the artificial parameters that are sensitive to the local changes of the input data, especially when the training data are limited. More specifically, in contrast to traditional feature extraction methods that stack the original spectral feature with spatial neighborhood information directly, we utilize a hierarchical SSAE network to learn a deep joint spectral-spatial feature that can effectively discover the underlying contextual and structure information in HSIs, which takes full advantage of the variable spectral and spatial features. Such hierarchical SSAE network contains much less parameters than CNN and it is pre-trained with limited labeled samples selected through the AL strategy on source domain. Moreover, active transfer learning is exploited to transfer the pre-trained SSAE network and few training data from the source domain to the target domain, where two different AL strategies are utilized: one selects a limited number of the most informative samples from target domain, while the other removes those samples that are incompatible with the target distribution from the source domain respectively. The pre-trained SSAE network is then fine-tuned with these few updated training samples, which greatly enhances the efficiency of the training process and makes the network more flexible for related HSIs classification tasks. In this way, a generic and robust feature representation is obtained with few high quality samples in our active transfer learning network. Extensive experiments over three widely-used datasets demonstrate that our proposed network, which is based on active transfer learning, has powerful transfer capability under various situations and significantly outperforms several state-of-the-art methods in terms of classification performance.
The rest of this paper is organized as follows. In Section II, both the sparse AE and stacked sparse AE models are introduced in detail. Section III introduces the framework of our proposed methods. Experiments and analysis are presented in Section IV. Section V concludes of this paper.
II Sparse Autoencoder Model
In this section, we briefly introduce a robust model of SSAE, which is adopted to learn a sparse and discriminative feature representation for HSI classification.
II-A Sparse Autoencoder
AE is constructed using three layers, i.e., an input layer, a hidden layer, and a reconstruction layer (output layer). From the input to the hidden layer, AE first maps the input to the hidden layer and generates a latent representation , a step termed the ”encoder” step. Feature is then decoded from the hidden layer to the reconstruction layer, which is regarded as an abstract representation of the input data. Fig. 1 clearly shows the relationship among these three layers.
We formulate the two steps (i.e. “encoder” and “decoder”) as follows:
[TABLE]
[TABLE]
where and denote the “encoder” and the “decoder” weights respectively, while and represent the biases of the hidden and reconstruction units.
The activation function , which is used to calculate the value of units in different layers, is generally set to be a sigmoid function. It can be formulated as:
[TABLE]
In fact, the aim of AE is to learn an approximation from the identity function that makes the reconstruction data as similar as possible to the input. Therefore, a loss function is designed to measure the difference between the input data and the reconstruction data, which can be described as:
[TABLE]
where
[TABLE]
The first term of the loss function uses a norm to measure the difference between the input data and the reconstruction data, while the second term of the loss function is a regulation term used to prevent over-fitting, and is a weight decay parameter balancing the effect of these two terms.
Unlike a simple autoencoder model, which learns a low-level compressed representation of the input, SAE [31] can learn an “overcomplete” representation by setting the number of hidden units to be larger than the number of input units.
In order to enforce some hidden units inactive for most of the time in SAE, the sparse parameter (which is close to zero) [47] is constrained by the average activation of a hidden unit. An extra sparse penalty term is added to the loss function to punish the average activations far away from . The loss function of the SAE[48] can be rewritten as:
[TABLE]
where
[TABLE]
Here, parameter is the weight of the sparse penalty, while is the Kullback-Leibler (KL) divergence used to measure the difference between mean and mean . The KL value increases rapidly when the difference between and grows. Thus, the sparse penalty term enforces close to when this value reaches the minimum.
II-B Stacked Sparse Autoencoder (SSAE)
SSAE [31] is a deep architecture of SAE that stacks several hidden layers of basic SAE together, meaning that the output of each layer is regarded as the input of the subsequent layer in SSAE.
Fig. 2 illustrates the construction process of SSAE specifically. First, the original input data x is trained by a SAE for learning a nonlinear feature . This feature is regarded as the input of the secondary SAE and used to extract the more abstract feature . Finally, a SSAE with two hidden layers is formed by stacking these two SAEs together, allowing the learning of a deep feature from the input with a transform function . More technical details of SSAE can be found in [47] and [48].
III Proposed Method
In previous work, traditional deep learning models for HSI classification have tended to learn the high-level feature representation of the stacked spectral-spatial features [31][49]. These models are thus unable to take full advantage of the information contained in HSIs. Moreover, it is very difficult to obtain a large amount of labeled data to construct a well-trained DNN, and the learned feature representation is only suitable for a specific data set. These problems have motivated us to develop a novel hierarchical SSAE network that incorporates with active transfer learning, enabling the learning of a generic and robust spectral-spatial feature representation.
III-A *Architecture of the Proposed Method *
Fig. 3 shows the architecture of our proposed method. Spectral and spatial features are first extracted by two different, well-designed sub-networks. The two obtained features are then stacked together and input into the sub-network that follows. In this way, a joint deep spectral-spatial feature that is more suitable for our classification task can be achieved.
From spectral perspective, an SSAE with k hidden layers is designed to extract a more abstract spectral feature for a given HSI. Unlike the spectral characteristic, spatial information is represented as a 2-D structure, which contains a lot of useful contextual information for our classification task. If we directly exploit an end-to-end deep learning model to extract spatial feature, the spatial neighborhood information should be extracted within small patches, and then vectorized to 1-D structure, which may lead to the loss of the contextual information [34]. Moreover, such end-to-end learning model generally depends on a great quantity of training samples with strong supervision. Inspired by [15], an extended morphological attribute profile (EMAP) is utilized in our preprocessing stage to maintain the spatial structure information in 1-D structure as the input of SSAE without any supervision, which superiority can be shown in Section IV. The EMAP can learn the raw spatial structure information from the first principal components of HSI by using multiple attribute profiles (APs). The shallow model proposed in [50] chose the output of EMAP as a spatial feature and combined it with the original spectral feature. In our framework, spatial information is preprocessed by EMAP with standard deviation and area attributes, then sent into a SSAE branch as a 1-D signal to learn its corresponding deep spatial feature .
We have now obtained two deep descriptors, and , in our feature extraction sub-networks. Next, we design a task-driven fusion method, which stacks the learned deep spectral and spatial feature (i.e., and ) together as a new feature before feeding it into another SSAE. This SSAE uses hidden layers to fuse the stacked spectral-spatial feature, after which the last hidden layer outputs a feature that can be regarded as the deep joint spectral-spatial feature . This feature contains most of spectral and spatial information in HSI. Finally, the deep joint spectral-spatial feature is classified by the softmax regression layer, which is used to predict the conditional probability distribution of each class as follows:
[TABLE]
where and are the weight and bias of the softmax regression layer respectively.
III-B Active Sampling Strategy for Pre-trained Network
In order to pre-train hierarchical SSAE networks on the source domain and achieve an impressive performance, a large amount of labeled training samples are required for supervised learning. However, only a very limited amount of labeled samples are available in practice for training SSAE in the task of HSI classification, which is prone to over-fitting. To address this problem, the batch-mode AL method is considered here for use in selecting a small set of high-quality samples so that SSAE can be trained in a more effective way.
In the AL strategy, there are various query functions (or criteria) can be chosen for sample selection, including margin sampling (MS) [35] and multiclass-level uncertainty (MCLU) [36]. In most previous work, AL has generally been applied to the shallow models like multi-class SVM [51]. An SAE with AL procedure was proposed for HSI classification in [38], where the uncertainty criterion of the query function depends on the output of the softmax layer and the most uncertain samples are added iteratively into the training set in order to retrain the classifier.
The flowchart of our proposed batch-mode AL sampling method is shown in Fig. 4. As illustrated, we first exploit the greedy layer-wise training strategy to train SSAE layer by layer with a few training samples, then the learned features of these samples and their corresponding labels are used to train a softmax classifier and fine-tune SSAE with the softmax layer by using backward propagation. These training samples are sufficient to train SSAE to learn a favorable feature representation, which can be verified in Section IV. Next, a subset of unlabeled data regarded as the candidate set is classified with softmax regression. Finally, AL iteratively selects the most uncertain unlabeled samples, adds them into the training set with true labels, and simultaneously removes them from the candidate set. Unlike the AL sampling procedure reported in [38], our proposed method first exploits a few training samples and their corresponding labels to train and fine-tune SSAE with a softmax layer, and then in order to boost the performance of SSAE, AL selects some most informative samples to refine-tune SSAE with the softmax layer at each iteration, instead of only retraining the parameters the softmax layer. By this way, a well-trained SSAE can be obtained more effectively with limited training samples.
Here, the MCLU technique is chosen as the query criterion, which applies a difference function to record the uncertainty of unlabeled samples. on logistic regression [39] considers the difference between largest and second largest class-conditional-probability density as the following object function:
[TABLE]
where
[TABLE]
[TABLE]
If the value of is large, the possibility that belongs to is high. On the contrary, a small indicates that will be assigned to the predicted class with a low confidence. Under these circumstances, is treated as an uncertain sample and should be manually labeled in order to better describe the distribution of the feature space. Namely, MCLU technique selects a subset of unlabeled samples with the minimum value of ; this subset contains more information about the unlabeled data to be labeled by the supervisor.
III-C Active Knowledge and Samples Transfer Learning
To facilitate the learning of a generic and flexible feature representation for multiple related HSIs and improve the performance of target task with limited training samples, we here propose an active transfer learning method which tranfers the knowledge and training samples learned from the source domain to the target domain.
Traditional TL methods have been successful in the high-resolution remote sensing images (HRRS) classification task. They usually prefer to transfer the bottom layers of a network that has been pre-trained on ImageNet [42][44] or on the related images [45] [46] to the target network, and then employ target data to train the top layers of target network for HRRS classification. In our proposed active transfer learning framework, we not only adopt the network pre-trained on source HSIs, but also pay more attention to the relationship between the distribution of source and target HSI data. The technical details are introduced in Algorithm 1. We firstly initialize hierarchical SSAE networks using the training samples on the source domain by an AL sampling strategy. The pre-trained network and training set on the source domain are then transferred to the target domain simultaneously. Subsequently, we iteratively update the training set with two different criteria on the target and source domain (i.e., the sample selecting criterion in Eq. (12) and the sample removing criterion in Eq. (14)), which are used to fine-tune the pre-trained SSAE network.
[TABLE]
The AL with MCLU sampling strategy shown in Eq. (9) iteratively selects the minimum number of the most informative samples from the target domain and add them to the source training set . Meanwhile, the training samples in that can not match the distribution of the updated training set in -th iteration will be discarded. The criterion for removing source domain data is as follows:
[TABLE]
where measures the difference between class-conditional-probability densities of source training samples and the updated training data obtained at the -th iteration for each sample . If this value is small, the distribution of the class changes only a little for source data . On the other hand, a large indicates that the distribution of class on the source domain has shifted gradually towards target data after iterations and can no longer describe the distribution of the target domain. Therefore, should be removed from the training set. Sample removing criterion selects the sample that has a maximum value of , as follows:
[TABLE]
Finally, the training set updated by the two criteria iteratively fine-tunes the source pre-trained network, gradually tailoring it to the distribution of target domain until the stopping criterion is satisfied. Meanwhile, it learns a generic and robust feature representation for the target data.
The stopping criterion is determined by the value of the loss function of SSAE, which guarantees the convergence of the algorithm. As we know, during the training process of SSAE, the value of the loss function first decreases fast, and then oscillates near the minimum and the value of classification accuracy reaches its best at the meantime. Thus, the iteration of active transfer learning will stop if the value of the loss function is less than , where is defined according to the experiments.
IV Experiments and Discussions
In this section, in order to comprehensively evaluate the performance of our proposed method, we carry out different experiments on three popular hyperspectral data sets and compare our method with several state-of-the-art alternatives. Moreover, all the following experiment results are generated on a Windows 7 personal computer equipped with a 64-bit Intel Core i5-3470 CPU running at 3.2GHz and an 8GB RAM. All the proposed methods are implemented with MATLAB R2015b.
IV-A Datasets and Settings
We employ four widely-used hyperspectral datasets in the experiments: Pavia University, Pavia Center, Salinas Valley and Indian Pines. Fig. 5 shows their false color map and groundtruth map respectively. Besides, Tables I and II also show some details about the three data sets.
IV-A1 Pavia University
This dataset was collected by the Reflective Optics System Imaging Spectrometer (ROSIS) sensor over Pavia, Italy. 103 useful spectral bands remain after removing noise-affected bands from the 115 total bands. The size of the image on a band is pixels. There are 9 classes of land cover used in the experiment. Fig. 5 (a) shows the false color map and groundtruth map of Pavia University.
IV-A2 Pavia Center
This dataset was obtained in the same way as Pavia University. There are 9 different types of land objects in total. The image size of Pavia Center is pixels after deleting a 381-pixel-wide black band. The false color map and the groundtruth map are shown in Fig. 5 (b).
IV-A3 Salinas Valley
This dataset was acquired by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor over Salinas, California, and contains 204 bands after removing water absorption bands and noisy bands. The image size is 512 x 217 pixels, and there are 16 classes of land covers on the scene. The false color image and groundtruth map are shown in Fig. 5 (c).
All of these four datasets are divided into three parts: the training, candidate and test sets. We first disorder the data in each class randomly, and then select (=25, 50, 75, 100) labeled samples from each class for training, 20% of the unlabeled data per class as the candidate data and the rest data of the image data for testing. Moreover, in order to avoid the randomness in sampling and the marginal performance, we repeat the experiments 10 times and use the mean value of the classification results to evaluate the performance of the proposed methods.
In active transfer learning experiment, we consider various transfer situations, including cross-dataset and intra-image transfer. In cross-dataset case, Pavia University and Pavia Center can be used as source datasets for each other because they have some classes of land covers in common. As for Salinas Valley, we apply Indian Pines (collected by the same sensor with Salinas Valley) to pre-train hierarchical SSAE networks. Indian Pines contains 16 classes of land covers, but their types differ from those in Salinas Valley. The details can be found in Fig. 5 (d). We assume that the four datasets all have sufficient samples for training. For the intra-image case, we select two different regions as source and target domains respectively among three datasets respectively; these regions have the same classes of land covers.
As for the evaluation metric, commonly used statistics such as overall accuracy (OA), average accuracy (AA), and Kappa coefficient (Kappa) [52] are applied to record and assess the performance of different classification methods.
IV-B SSAE Structure Analysis
Unlike the shallow feature extraction models, SSAE can learn the feature distribution automatically. Thus, the setting of the structure parameters for SSAE will play an important role in the quality of the extracted features. Here, we investigate how the number of hidden layers in SSAE and different query functions of AL influence the classification performance.
IV-B1 Depth Effect
The number of hidden layers in SSAE is the key factor affecting the classification performance, which determines the abstraction level of the input features. Here, we change only the number of hidden layers and fix other parameters to check the corresponding performance. For each dataset, we use 3% labeled samples per class to train the network. The weights and biases in each layer are initialized randomly and optimized by minimizing the loss function in a greedy layer-wise manner. We try several SSAEs with different depths vary from 1 to 3 layers. In this experiment, it is suggested that the units in the first hidden layer are set to learn an “overcomplete” feature representation of the input data during the unsupervised feature learning stage. For Pavia University and Pavia Center, the classification performance reaches its best value when the number of units for each hidden layer in feature-extraction SSAE is set to 200, 150, and 100, and the number of units for each hidden layer in feature-fusion SSAE is set to 400, 200, and 150. For Salinas Valley, each hidden layer in feature-extraction SSAE is composed of 400, 300, 200 units and each hidden layer in feature-fusion SSAE is composed of 800, 400, 300 units. The number of output units in the softmax layer is equal to the number of classes of land covers in different datasets. The training samples are used for pre-training and fine-tuning the whole SSAE with the softmax layer. The number of iterations for both the unsupervised training and supervised fine-tuning stages is set to 500. maximal permitted number of iterations of softmax classifier is 200. For the standard stochastic gradient descent method, the sparsity parameter is set to 0.1, while the weight decay parameter [53] is fixed as 7 and the learning rate is 0.05. The sparsity penalty weight is set to 0.05.
The depth effect on the classification results for different datasets is shown in Tables III and IV. Table III illustrates the sensitivity of feature-extraction SSAE over the number of hidden layers, while Table IV evaluates the sensitivity of feature-fusion SSAE. From the classification results shown in these tables, we can see that when the number of hidden layers changes from 1 to 2, OA increases significantly. However, these values flatten out when the depth changes from 2 to 3. For Salinas Valley, the classification performance can be seen to decline by 0.68% in Table IV when the hidden layer number increases. Therefore, in our proposed network, we construct SSAE with two hidden layers for both feature extraction and feature fusion.
IV-B2 Query Function Effect
In order to verify the effectiveness of AL sampling with MCLU method, we compare MCLU technique with other two typical query functions: namely random sampling method and MS method. In experiments, we first divide the dataset into 3 parts: 50 labeled samples of each class are randomly selected for training SSAE, and 20% of the unlabeled data per class is regarded as the candidate set. The rest of the reference data are exploited for use in testing the classification performance. The training samples are used to pre-train the parameters of SSAE with two hidden layers. Three different query functions are then combined with softmax layer respectively to select 50 most uncertain samples (query step) from the candidate set at each iteration. These samples are added into the training set with true labels to fine-tune the network and simultaneously removed from the candidate set. For the setting of superior limit of label queries (the budget N), we use 26 active learning iterations with 1,300 samples to check the changing trend of the classification results. We then record the classification results of AL based SSAE with three different query functions, as shown in Fig. 6. Comparing the classification results among three datasets, we find that the MCLU technique always outperforms the other two techniques during the AL procedure on all datasets, especially on Salinas Valley. Therefore, the MCLU technique is applied as the query function in our proposed framework.
IV-C Comparison with State-of-the-Art Methods
In this section, we aim to compare our proposed method with five other state-of-the-art classification methods. These methods are: joint spectral-EMAP feature representation with SVM (Spe-EMAP SVM) [54], joint spectral-spatial feature representation with SAE (JSSAE) [29], deep spectral feature representation with SSAE (Spe-SSAE), deep EMAP feature representation with SSAE (EMAP-SSAE), and joint spectral-EMAP feature representation with SSAE (Spe-EMAP SSAE). SVM is a typical shallow classification method regarded as a benchmark in the field of HSI classification. Spe-SSAE and EMAP-SSAE are two branches of our proposed method and are used to determine whether or not our deep joint spectral-spatial feature representation model takes effect. The JSSAE method is used to compare the influence of different spatial features on the classification performance.
To set the parameters of these baseline methods, we first tune the parameters for their best performance. The Lib-SVM toolbox [55] is considered to finish Spe-EMAP SVM method. We use linear SVM as the classifier and find the optimal parameters of SVM by means of five-fold cross validation, with SVM parameter in the range of [25, 26, 27, 28, 29, 210] and in the range of [2*-5*, 2*-4*, … , 24, 25]. In order to design a fair comparison, the structure of JSSAE is kept the same as ours, including the number of hidden layers, training iterations and learning rate. The spatial information is extracted from the spatial neighborhood in JSSAE and the size of spatial neighborhood is set to 55 pixels. Spe-EMAP SSAE stacks the original spectral feature with the EMAP feature directly to learn the joint spectral-spatial features. The parameters of the rest of the baseline methods are set as illustrated in Section IV-B.
In experiments, we reproduce all the above methods on four different training datasets: 5%, 10%, 15%, and 20% of the original data. We repeat all of the above HSI classification methods 10 times. The mean of overall accuracy for the three datasets is recorded in Tables V, VI, and VII. It can be seen that while SVM takes the same features as the SSAE method, the overall accuracy of SSAE outperforms the SVM method on different training sets. This demonstrates that the deep feature is more stable and robust that can raise the effectiveness of HSI classification. The overall accuracy of SSAE that takes EMAP features as input outperforms the JSSAE methods, indicating that spatial structure information extracted by EMAP is more appropriate for SSAE to learn a spatial feature representation than the spatial neighborhood information. Comparing the Spe-EMAP SSAE with the Spe-SSAE and EMAP-SSAE methods, we find that multiple features learning with a deep model does indeed improve the performance of HSI classification. Particularly, our proposed method outperforms other classification methods with different training sets among three datasets, thus demonstrating that the deep joint spectral-spatial feature makes a genuine difference and takes good advantage of the spectral/spatial signatures. Parts of the different classification maps are shown in Figs. 7, 8 and 9. Here, we can see that there are far fewer wrongly-labeled pixels in the map of our method than other methods, especially in Salinas Valley.
IV-D Transferability of Active Transfer Learning Network
IV-D1 Analysis of AL Procedure of the Pre-trained Network
In this section, we train our model with the AL sampling strategy over all three datasets. We first set four training sets by randomly selecting 25, 50, 75, and 100 labeled samples in each class. 20% of unlabeled data per class is regarded as the candidate set. The MCLU technique is exploited as the query function to choose 50 most informative samples from the candidate set, according to the prediction of softmax layer for different classes at each iteration. Finally, the whole network is fine-tuned iteratively by the updated training data. In the experiments, we set the upper limit of AL iterations at 26. The classification results of the AL procedure for the pre-trained network are shown in Fig. 10, and the values of OA, AA, and Kappa in different training sets for three datasets are recorded in Table VIII.
As shown in Fig. 10, the curves of AL sampling with 26 iterations demonstrate that the overall accuracy increases rapidlly during the first 4 or 5 iterations, after which the curve becomes flattened. Our method only uses less than half of the labeled samples used in the non-AL method to train SSAE and obtain a promising classification result. The values of OA, AA, and Kappa are all higher than those for the non-AL sampling method shown in Tables V, VI, and VII, which indicates that the selected most uncertain samples can better describe the distribution of the unlabeled data and effectively avoid labeling the redundant samples.
IV-D2 Transferability of the Pre-trained Network
In this paper, we adopt the active transfer learning method to learn the various spectral and spatial signatures of different land covers between the source domain and target domain. Here, we transfer four different source training sets and the pre-trained feature representation networks on these source training sets to the target domain. As noted above, Pavia University and Pavia Center can be used as source data for each other, and the initial training sets on these datasets for pre-training are set at 25, 50, 75, and 100 samples per class. Because there are not enough samples on Indian Pines, we use 5%, 10%, 15%, and 20% labeled data per class to initialize the pre-trained network and then transfer it to Salinas Valley.
In order to learn the deep feature representation of spectral and spatial information on the target domain, we first randomly select 20% of the unlabelled target data as the candidate set; the remaining 80% of samples are set as the test data. The original source training data are regarded as the initial training set. All the candidate data are then trained by the pre-trained network. AL queries 80 most informative samples in the candidate set and adds them into the training set with the true labels, while 50 samples in the original source training set are iteratively removed from the training set. We extract the deep spectral and spatial feature from the source domain and target domain by the transferred network, after which the deep feature fusion network is initialized by the stacked deep spectral-spatial feature of the original source training data. Finally, the network is transferred by the target deep spectral-spatial feature.
During the experiments, we record the value of SSAE loss function after every iteration of active transfer learning in Fig. 11 and find that for Pavia University and Pavia Center, the value of the loss function does not decrease and the value of the classification accuracy reaches its best when the value of the loss function is less than 510*-6*, so we set for these two datasets. As for Salinas Valley, we set . Finally, the number of active transfer learning iterations is set to 10 according to the evolution of the loss function on three datasets.
The performance results of the active transfer learning methods are presented in Fig. 12 and Tables IX, X, and XI. It is obvious that the active transfer learning method successfully learns the various spectral/spatial signatures in related HSIs and achieves a promising classification result with few target samples. Moreover, it can also be seen that the improvement of the classification performance is influenced by the size of the source training data. When the number of initial training samples per class is more than 50 (on Pavia University and Pavia Center) or the ratio reaches 10% (on Salinas Valley), the value of overall accuracy slightly declines. This is because more source domain information prevents the network from learning the distribution of the target domain. The curves shown in Fig. 12 indicate that while the classification performance for the target domain is very poor when target samples are not used, the overall accuracy increases significantly when only 80 samples are added into the training set. This is due to the effects of the phenomenon of “domain shift”. Particularly, the disparity between the source domain and the target domain makes a substantial difference to the performance of the active transfer learning method. Fig. 12 indicates that our proposed method outperforms on Pavia University than on Salinas Valley. This is because there are common land cover classes in Pavia University and Pavia Center, while the classes of land covers between Indian Pines and Salinas Valley are different.
IV-D3 Computational Costs of Active Transfer Learning Network
We report the computational time of the proposed method in Table XII. In general, when compared with the shallow classification methods, the deep neural network, as the proposed method, takes more time to train the network because it needs iterative calculation. On the other hand, the test time of the deep neural network can be much shorter, which is more important in real classification tasks. As illustrated in Table XII, we can find that for the pre-trained network on source domain, the training time is respectively 28.62, 41.20, and 62.67 minus for Pavia University, Pavia Center and Salinas Valley, and it only spends 0.013, 0.039 and 0.026 minus testing samples on these datasets, which is much shorter than several deep learning based methods because our proposed method is robust and it can extract the deep joint spectral-spatial feature of test samples quickly. As for the transferred network, the training time is 21.68, 26.51 and 54.07 minus for these three datasets, which is shorter than the training time of the pre-trained network. It is because few target training samples are used to transfer the pre-trained model and the number of active transfer learning iterations is much fewer than the number of AL iterations of the pre-trained model.
IV-D4 Domain Adaptation of the Pre-trained Network
Here, we verify the domain adaptation of the pre-trained network for three datasets. As shown in Fig. 14 (a) and (b), the ground-truth maps of the source and target domains contain five classes (class 1, 2, 3, 4, and 8) on Pavia University and five classes (class 1, 2, 3, 5, and 8) on Pavia Center. On Salinas Valley, there are 6 classes (class 1, 10, 11, 12, 13, and 14) shown in Fig. 14 (c). For Pavia University and Salinas Valley, we randomly select 40 samples per class in the source domain to pre-train the network. We use 30 source samples per class in Pavia Center. We query 20 samples in the target domain over 15 iterations for all three datasets. The performance of the domain adaptation method is presented in Fig. 13 and Table XIII. This method relies on very limited training data and still obtains an effective result. We can conclude that active transfer learning does take effect on the domain adaptation, especially when there is a large distribution gap between the source and target domains.
V Conclusion
In this paper, we propose a novel active transfer learning network for HSI classification, where two SSAE sub-networks are applied to extract deep spectral and spatial features and one sequential SSAE sub-network is used to seamlessly fuse these deep features. The AL sampling method is exploited to select a subset of the most informative unlabeled samples for labeling and add them to the training set at each iteration, which can boost the performance of a pre-trained network with limited labeled samples. Meanwhile, considering the variable spectral/spatial signatures of different land covers in related HSIs, the pre-trained network and the training data from the source domain are transferred to the target domain. Subsequently, the pre-trained network is fine-tuned with the updated training set, which comes from two sources, i.e., the most informative samples in the target domain and the source samples remaining after removing those discrepant with the distribution of the target domain. Experimental results demonstrate that the proposed method exhibits promising performance compared with many state-of-the-art approaches. In the future, the optimal architecture parameters of hierarchical SSAE and the AL sampling criterion used in our method still need further research. Moreover, it is also worth investigating the possibility of exploiting useful transfer knowledge among the data from different sensors.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] F. M. Lacar, M. M. Lewis, and I. T. Grierson, “Use of hyperspectral imagery for mapping grape varieties in the barossa valley, south australia,” in in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS), Sydney, Australia , vol. 6, pp. 2875–2877, 2001.
- 2[2] F. V. D. Meer, “Analysis of spectral absorption features in hyperspectral imagery,” Int. J. Appl. Earth Observ. Geoinf. , vol. 5, pp. 55–68, Jan. 2004.
- 3[3] T. J. Malthus and P. J. Mumby, “Remote sensing of the coastal zone: An overview and priorities for future research,” Int. J. Remote Sens. , vol. 24, pp. 2805–2815, Nov. 2003.
- 4[4] J. M. Bioucas-Dias, A. Plaza, G. Camps-Valls, P. Scheunders, N. Nasrabadi, and J. Chanussot., “Hyperspectral remote sensing data analysis and future challenges,” Geosci. Remote Sens. Mag. , vol. 1, pp. 6–36, Feb. 2013.
- 5[5] G. Camps-Valls and L. Bruzzone, “Kernel-based methods for hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens. , vol. 43, pp. 1351–1362, Jun. 2005.
- 6[6] G. M. Foody and A. Mathur, “A relative evaluation of multiclass image classification by support vector machines,” IEEE Trans. Geosci. Remote Sens. , vol. 42, no. 6, pp. 1335–1343, 2004.
- 7[7] J. Li, J. M. Bioucas-Dias, and A. Plaza, “Semisupervised hyperspectral image classification using soft sparse multinomial logistic regression,” IEEE Geosci. Remote Sens. Lett. , vol. 10, pp. 318–322, March 2012.
- 8[8] L. M. Bruce, C. H. Koger, and J. Li, “Dimensionality reduction of hyperspectral data using discrete wavelet transform feature extraction,” IEEE Trans. Geosci. Remote Sens. , vol. 40, pp. 2331–2338, Oct. 2002.
