PUNCH: Positive UNlabelled Classification based information retrieval in   Hyperspectral images

Anirban Santara; Jayeeta Datta; Sourav Sarkar; Ankur Garg; Kirti; Padia; Pabitra Mitra

arXiv:1904.04547·cs.IR·April 10, 2019

PUNCH: Positive UNlabelled Classification based information retrieval in Hyperspectral images

Anirban Santara, Jayeeta Datta, Sourav Sarkar, Ankur Garg, Kirti, Padia, Pabitra Mitra

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel material-agnostic positive-unlabeled classification framework for hyperspectral image retrieval, addressing label scarcity and spectral variability, with two approaches and validation on benchmark datasets.

Contribution

It develops a new PU learning framework for hyperspectral image retrieval that works across materials and scenes without revealing material identity.

Findings

01

The proposed methods outperform baseline models on benchmark datasets.

02

Material-agnostic approach improves generalization in hyperspectral classification.

03

Two annotation models effectively simulate human labeling patterns.

Abstract

Hyperspectral images of land-cover captured by airborne or satellite-mounted sensors provide a rich source of information about the chemical composition of the materials present in a given place. This makes hyperspectral imaging an important tool for earth sciences, land-cover studies, and military and strategic applications. However, the scarcity of labeled training examples and spatial variability of spectral signature are two of the biggest challenges faced by hyperspectral image classification. In order to address these issues, we aim to develop a framework for material-agnostic information retrieval in hyperspectral images based on Positive-Unlabelled (PU) classification. Given a hyperspectral scene, the user labels some positive samples of a material he/she is looking for and our goal is to retrieve all the remaining instances of the query material in the scene. Additionally, we…

Tables4

Table 1. Table 1. data sets used

	Indian Pines	Salinas	U. Pavia
Sensor	AVIRIS	AVIRIS	ROSIS
Place	Northwestern Indiana	Salinas Valley California	Pavia, Northern Italy
Frequency Band	$0.4$ - $0.45 μ m$	$0.4$ - $0.45 μ m$	$0.43$ - $0.86 μ m$
Spatial Resolution	$20 m$	$20 m$	$1.3 m$
No. of Channels	220	224	103
No. of Classes	16	16	9

Table 2. Table 2. Temperature and Baseline settings

	Indian Pines	Salinas	U. Pavia
Temperature ( $T$ )	$24$	$22$	$14$
Baseline ( $b$ )	$32$	$26$	$26$

Table 3. Table 3. PU classification results

Dataset	Metric	Retrieval Model
		Uniform Sampling		Blob Sampling
		NNRE-PU	PN	NNRE-PU	PN
Indian Pines	AUC	$0.90$	$0.85$	$0.83$	$0.67$
	Precision	$83.0 %$	$48 %$	$46 %$	$48 %$
	Recall	$77.1 %$	$83.1 %$	$81.1 %$	$42 %$
	F-score	$0.80$	$0.61$	$0.59$	$0.45$
Salinas	AUC	$0.98$	$0.99$	$0.99$	$0.91$
	Precision	$99.8 %$	$99.7 %$	$99.9 %$	$99.9 %$
	Recall	$98.3 %$	$99.9 %$	$97.7 %$	$82 %$
	F-score	$0.99$	$0.99$	$0.98$	$0.89$
Pavia U	AUC	$0.90$	$0.95$	$0.91$	$0.93$
	Precision	$91 %$	$96.4 %$	$80.2 %$	$60 %$
	Recall	$88.6 %$	$76 %$	$86 %$	$91 %$
	F-score	$0.89$	$0.85$	$0.83$	$0.72$

Equations24

R_{p n} (g) = π_{p} R_{p}^{+} (g) + π_{n} R_{n}^{-} (g)

R_{p n} (g) = π_{p} R_{p}^{+} (g) + π_{n} R_{n}^{-} (g)

R_{p u} (g) = π_{p} R_{p}^{+} (g) - π_{p} R_{p}^{-} (g) + R_{u}^{-} (g)

R_{p u} (g) = π_{p} R_{p}^{+} (g) - π_{p} R_{p}^{-} (g) + R_{u}^{-} (g)

\hat{R}_{nn - P U} (g) = π_{p} \hat{R}_{p}^{+} (g) + max (0, \hat{R}_{u}^{-} (g) - π_{p} \hat{R}_{p}^{-} (g))

\hat{R}_{nn - P U} (g) = π_{p} \hat{R}_{p}^{+} (g) + max (0, \hat{R}_{u}^{-} (g) - π_{p} \hat{R}_{p}^{-} (g))

\frac{\partial u}{\partial y} (x) := \frac{u ( y ) - u ( x )}{d ( x , y )}, \forall x, y \in Ω

\frac{\partial u}{\partial y} (x) := \frac{u ( y ) - u ( x )}{d ( x , y )}, \forall x, y \in Ω

\nabla_{w} u (x) (y) = \frac{\partial u}{\partial y} (x) := w (x, y) (u (y) - u (x))

\nabla_{w} u (x) (y) = \frac{\partial u}{\partial y} (x) := w (x, y) (u (y) - u (x))

u min E (u) = ∣∣ \nabla_{w} u ∣ ∣_{L^{1}} + λ S (u)

u min E (u) = ∣∣ \nabla_{w} u ∣ ∣_{L^{1}} + λ S (u)

L_{\times- e n t r o p y} = c = 1 \sum M y_{o, c} lo g p_{o, c}

L_{\times- e n t r o p y} = c = 1 \sum M y_{o, c} lo g p_{o, c}

P r_{+} (X ∣ X \in I_{U}) = \frac{1}{1 + exp ( \frac{d _{+} ( X ) - b}{T} )}

P r_{+} (X ∣ X \in I_{U}) = \frac{1}{1 + exp ( \frac{d _{+} ( X ) - b}{T} )}

P r_{-} (X ∣ X \in I_{U}) = \frac{m + ϵ}{n}

P r_{-} (X ∣ X \in I_{U}) = \frac{m + ϵ}{n}

P r_{+} (X ∣ X \in I_{U}) = (\frac{m + ϵ}{n}) \times (\frac{1}{1 + exp ( \frac{d _{+} ( X ) - b}{T} )})

P r_{+} (X ∣ X \in I_{U}) = (\frac{m + ϵ}{n}) \times (\frac{1}{1 + exp ( \frac{d _{+} ( X ) - b}{T} )})

f (x) = \frac{x - min ( x )}{ma x ( x ) - min ( x )}

f (x) = \frac{x - min ( x )}{ma x ( x ) - min ( x )}

C = (1 - π_{p}) α x + π_{p} β (1 - y)

C = (1 - π_{p}) α x + π_{p} β (1 - y)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

HSISeg/HSISeg
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRemote-Sensing Image Classification · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques

Full text

PUNCH: Positive UNlabelled Classification based information retrieval in Hyperspectral images

Anirban Santara

[email protected]

0002-1571-3885

Indian Institute of Technology, KharagpurKharagpurWBIndia721302

,

Jayeeta Datta

[email protected]

University of Pennsylvania3330 Walnut StreetPhiladelphiaPA19104-6309

,

Sourav Sarkar

[email protected]

Columbia University500 W 120th StNew YorkNY10027

,

Ankur Garg

[email protected]

Space Applications Centre, Indian Space Research Organization (ISRO)AhmedabadGJ380015India

,

Kirti Padia

[email protected]

Space Applications Centre, Indian Space Research Organization (ISRO)AhmedabadGJ380015India

and

Pabitra Mitra

[email protected]

Indian Institute of Technology, KharagpurKharagpurWBIndia721302

(2019)

Abstract.

Hyperspectral images of land-cover captured by airborne or satellite-mounted sensors provide a rich source of information about the chemical composition of the materials present in a given place. This makes hyperspectral imaging an important tool for earth sciences, land-cover studies, and military and strategic applications. However, the scarcity of labeled training examples and spatial variability of spectral signature are two of the biggest challenges faced by hyperspectral image classification. In order to address these issues, we aim to develop a framework for material-agnostic information retrieval in hyperspectral images based on Positive-Unlabelled (PU) classification. Given a hyperspectral scene, the user labels some positive samples of a material he/she is looking for and our goal is to retrieve all the remaining instances of the query material in the scene. Additionally, we require the system to work equally well for any material in any scene without the user having to disclose the identity of the query material. This material-agnostic nature of the framework provides it with superior generalization abilities. We explore two alternative approaches to solve the hyperspectral image classification problem within this framework. The first approach is an adaptation of non-negative risk estimation based PU learning for hyperspectral data. The second approach is based on one-versus-all positive-negative classification where the negative class is approximately sampled using a novel spectral-spatial retrieval model. We propose two annotator models – uniform and blob – that represent the the labelling patterns of a human annotator. We compare the performances of the proposed algorithms for each annotator model on three benchmark hyperspectral image datasets – Indian Pines, Pavia University and Salinas.

Information Retrieval, Positive Unlabelled Classification, Convolutional Neural Network (CNN), Deep Learning, Hyperspectral Imagery, Landcover Classification

††journalyear: 2019††copyright: acmlicensed††ccs: Computing methodologies Hyperspectral imaging††ccs: Information systems Probabilistic retrieval models††ccs: Computing methodologies Image representations††ccs: Computing methodologies Cost-sensitive learning††ccs: Computing methodologies Neural networks

1. Introduction

Hyperspectral imaging (HSI) (Landgrebe, 2002; Richards, 2013) measures reflected radiation from a surface at a series of narrow, contiguous frequency bands. It differs from multi-spectral imaging which senses a few wide, separated frequency bands. Hyperspectral imaging produces three-dimensional $(x,y,\lambda)$ data volumes, where $x$ and $y$ represent the spatial dimensions and $\lambda$ represents the spectral dimension. Such detailed spectra contain fine-grained information about the chemical composition of the materials in a scene that is richer than is available from a multi-spectral image (Geography, 2018).

Majority of the existing literature on hyperspectral image classification frame it as a supervised (Camps-Valls et al., 2013; Camps-Valls and Bruzzone, 2005; F.Melgani and B.Lorenzo, 2004; Li et al., 2016; Santara et al., 2017) or semi-supervised (Camps-Valls et al., 2007; Buchel and Ersoy, 2018; Cui et al., 2018) multi-class classification problem. Hyperspectral images pose a unique set of challenges when it comes to multi-class classification. Each material with a distinct spectral identity is a class. Hence the number of possible classes is countably infinite. When it comes to land-cover, the same species of crop (for example, wheat) can have drastically different spectral signatures depending on the location where it is grown (due to differences in chemical composition of soil and water) and the time of the year (temperature, humidity, rainfall, etc) (Herold et al., 2004; Rao et al., 2007). Hence, creating a standardized library of spectral signatures of materials considering all these factors of variability is hard. Over and above, multi-class classification requires ground-truth labels for each class. Collection of ground-truth amounts to sourcing samples of a material from the exact location and exact time of the year and recording their spectral signatures (Shepherd and Walsh, [n. d.]) - a process that is impractical and intractable to be performed at scale. Also, different hyperspectral imaging systems produce images with different physical properties depending on the spectral response of the sensor, resolution, altitude, illumination and mode of capture (airborne vs. spaceborne), distortions and so on. As a result, multi-class classification models trained on open-source (but relatively old) benchmark datasets like Indian Pines and Pavia have extremely limited efficacy when it comes to large-scale deployment in real life applications.

In this work, we formulate the hyperspectral classification problem as one that is material-agnostic, imaging system independent and not contingent on an extensive ground-truth labelling effort. The motivation is deployment at scale. Given a hyperspectral scene, the user marks-up some known occurrences of the query material. No information is provided about pixels that do not contain the query material. The goal of the system is to locate all other occurrences of the same material in the scene with high precision and recall. The system should work for any target material with a distinct spectral-signature and it should not require the user to disclose the identity of the material being searched.

Our formulation builds upon the classical problem of Content-Based Information Retrieval (CBIR) in multi-media databases (Yoshitaka and Ichikawa, 1999; Smeulders et al., 2000; Liu et al., 2007). Given a query item, the task is to retrieve items from a database that are similar in content. At the heart of CBIR lies the task of designing a retrieval model. A retrieval model is a function that returns a score that is an estimate of the similarity of an element of the database with the query element. These scores can be used to find the most relevant elements for output. With the advent of deep learning, CBIR has witnessed unprecedented records of success in domains ranging from images (Krizhevsky et al., 2012; Wan et al., 2014; Lin et al., 2015) to text (Mitra et al., 2017) and audio (Van den Oord et al., 2013) and multi-modal information retrieval (Kiros et al., 2014; Wang et al., 2016).

We approach the problem from a Positive-Unlabelled (PU) classification (Denis et al., 2000; Zhang and Zuo, 2008; Elkan and Noto, 2008; Hou et al., 2018) perspective. PU classification algorithms are specialized to deal with the setting where the training data comprises of positive samples labelled by the user and unlabelled samples that may consist of both positive and negative classes. There are two main approaches to PU classification. The first approach is based on heuristic-driven intelligent sampling of the negative class followed by supervised training of a binary Positive-Negative (PN) classifier with the labelled positive and sampled negative examples (Nigam et al., 1998). The second approach is based on non-negative risk-estimation in which the unlabelled data is treated as negative data with lesser weights (Lee and Liu, 2003). We explore both categories of algorithms in this work and present a comparison of performance results. Deep Neural Networks (Schmidhuber, 2015; Goodfellow et al., 2016) have demonstrated extraordinary capability to efficiently model complex non-linear functions in a large variety of applications including HSI classification (Santara et al., 2017; Zhu et al., 2017b). This has motivated us to use Deep Neural Networks with the state-of-the-art BASS-Net architecture of Santara et al. (Santara et al., 2017) as function approximators in all our experiments.

Our contributions in this paper can be summarized as follows:

•

We present a PU learning based formulation of the HSI classification problem for material and imaging-platform agnostic large-scale information retrieval. To the best of our knowledge, this is the first work on the investigation of PU Learning for HSI data.

•

We design one solution each from the two families of PU learning algorithms – non-negative risk estimation and PN classification – and compare their performances on three benchmark HSI datasets.

•

We propose a novel spectral-spatial retrieval model for HSI data and use it for intelligent sampling of negative class for PN classification.

•

We propose two annotator models that represent the range of labelling patterns of a human annotator and use them to demonstrate the efficacy of our proposed solutions under different spatial distributions of the labelled positive class.

Section 2 introduces the essential theoretical concepts that we build upon in this paper. Section 3 gives a detailed description of the proposed framework and approaches to solution. Experimental results are presented in Section 4. Finally, Section 5 concludes the paper with a summary of our contributions and scope of future work.

2. Background

In this section, we present a brief introduction to the essential theoretical concepts used in the rest of the paper.

2.1. Non-negative Risk Estimation based PU Learning (NNRE-PU)

Risk estimation based PU learning represents unlabelled data as a weighted combination of P and N data. Following the notation of Kiryo et al. (Kiryo et al., 2017), let $g:\mathbb{R}^{d}\rightarrow\mathbb{R}$ be an arbitrary decision function and $l:\mathbb{R}\times\{-1,1\}\rightarrow\mathbb{R}$ be the loss function such that $l(t,y)$ is the loss incurred on predicting $t$ when the ground truth is $y$ . Let $p(\mathbf{X},y)$ denote the joint probability distribution of image-pixels and their labels. The marginal distribution $p(\mathbf{X})$ is where the unlabelled data is sampled from. Let $p_{p}(\mathbf{X})$ and $p_{n}(\mathbf{X})$ denote the class-conditionals and $\pi_{p}$ and $\pi_{n}$ , the prior probabilities of the positive and negative classes respectively. The risk of the decision function $g$ can be written as:

[TABLE]

where $R_{p}^{+}(g)=\mathbb{E}_{\mathbf{X}\sim p_{p}}[l(g(\mathbf{X}),+1)]$ and $R_{n}^{-}(g)=\mathbb{E}_{\mathbf{X}\sim p_{n}}[l(g(\mathbf{X}),-1)]$ . Rewriting the law of total probability as $\pi_{n}p_{n}(\mathbf{X})=p(\mathbf{X})-\pi_{p}p_{p}(\mathbf{X})$ and substituting in equation 1, we have the expression for unbiased PU loss:

[TABLE]

where $R_{p}^{-}(g)=\mathbb{E}_{\mathbf{X}\sim p_{p}}[l(g(\mathbf{X}),-1)]$ and $R_{u}^{-}(g)=\mathbb{E}_{\mathbf{X}\sim p}[l(g(\mathbf{X}),-1)]$ . In unbiased PU learning (Elkan and Noto, 2008; du Plessis et al., 2014; Du Plessis et al., 2015), the goal is to minimize an empirical estimate of this risk $R_{pu}(g)$ (with the expectations replaced by sample averages) to find the optimal decision function $g$ . Unfortunately, the empirical estimators of risk used in unbiased PU learning have no lower bound although the original risk objective in equation 1 is non-negative (Kiryo et al., 2017). Minimization of the empirical risk tends to drive the objective negative without modeling anything meaningful especially when high capacity function approximators like deep neural networks are used to model $g$ . Kiryo et al. (Kiryo et al., 2017) propose the following biased, yet optimal, non-negative risk estimator to address this problem:

[TABLE]

where $\hat{R}$ denotes an empirical estimate of actual risk, $R$ . For the ease of training of a neural network classifier, with no loss in theoretical correctness, we represent the negative class by [math] instead of $-1$ in our experiments.

2.2. Non-Local Total Variation

Non-Local Total Variation (NLTV) is an unsupervised clustering objective demonstrated on hyperspectral data by Zhu et al. (Zhu et al., 2017a). Following the notation used by the authors, let $\Omega\subset\mathcal{I}$ be a region in a hyperspectral scene. Let $L^{2}(\Omega)$ be a Hilbert space. let $u:\Omega\rightarrow\mathbb{R},\;u\in L^{2}(\Omega)$ , be the labelling function of a cluster such that the larger the value of $u(\mathbf{X})$ , the higher is the likelihood of a pixel $\mathbf{X}$ belonging to that cluster. Let $d:\mathcal{I}\rightarrow\mathcal{I}$ be a measure of divergence between two given pixels such that a lower value of $d$ implies more resemblance. Non-local derivative is defined as:

[TABLE]

Non-local weight is defined as $w(x,y)=d^{-2}(x,y)$ . The expression for non-local derivative in equation 4 can be rewritten in terms of non-local weight as:

[TABLE]

The Non-Local Total Variation (NLTV) objective is given by:

[TABLE]

$S(u)$ is a data fidelity term representing the clustering objective and $||\nabla_{w}u||_{L^{1}}$ is the Total Variation regularizer. The parameter $\lambda$ controls the amount of regularization. The authors of (Zhu et al., 2017a) present a linear and a quadratic model of this objective, depending upon the design of $S(u)$ . They also apply the Primal Dual Hybrid Gradient (PDHG) algorithm (Chambolle and Pock, 2011) for minimization of these objectives and show encouraging results on hyperspectral image data. We use the quadratic model of the NLTV objective in our experiments.

2.3. BASS-Net architecture

Band-Adaptive Spectral-Spatial feature learning Network (BASS-Net) (Santara et al., 2017) is a deep neural network architecture for end-to-end supervised classification of Hyperspectral Images. Hyperspectral image classification poses two unique challenges: a) curse of dimensionality resulting from large number of spectral dimensions and scarcity of labelled training samples, and, b) large spatial variability of spectral signature of materials. The BASS-Net architecture is extremely data efficient thanks to extensive parameter sharing along the spectral dimension and is capable of learning highly non-linear functions from a relatively small number of labelled training examples. Also, it uses spatial context to account for spatial variability of spectral signatures. BASS-Net shows state-of-the-art supervised classification accuracy on benchmark HSI datasets like Indian Pines, Salinas and University of Pavia. Figure 2 shows a schematic diagram of the BASS-Net architecture.

3. PU Classification of Hyperspectral Images

In this section we describe the proposed PU learning algorithms for HSI classification. As mentioned in the previous section, we use the BASS-Net architecture of Santara et al. (Santara et al., 2017) as function approximators, whenever required, in our pipeline. The input to the network is a pixel $\mathbf{X}$ from the image $\mathcal{I}$ with its $p\times p$ neighborhood (for spatial context) in the form of a $p\times p\times N_{c}$ volume, where $N_{c}$ is the number of channels in the input image. The output is the predicted class label $\hat{y}\in\{-1,1\}$ for $\mathbf{X}$ . The specific configuration of BASS-Net that we use in our experiments is Configuration 4 of Table 1 of Santara et al. (Santara et al., 2017). As BASS-Net was originally made for multi-class classification, it used a softmax layer at the output and optimized a multi-class categorical cross entropy loss defined as follows.

[TABLE]

Where $M$ is the total number of classes, $y_{o,c}$ is a binary indicator function which returns $1$ when class $c$ is the correct classification of observation $o$ and [math] otherwise, and $p_{o,c}$ denotes the predicted probability of observation $o$ belonging to class $c$ . In PU learning, we work with binary classifiers. Hence we replace the softmax layer with a sigmoid layer and use the binary cross entropy loss function (equation 7 with $M=2$ ) for training.

Let $\mathcal{I}_{L+}$ denote the set of labeled positive data points and $\mathcal{I}_{U}$ be the unlabeled data points such that $\mathcal{I}=\mathcal{I}_{L+}\cup\mathcal{I}_{U};\;\mathcal{I}_{L+}\cap\mathcal{I}_{U}=\phi$ . Let $\pi_{p}$ denote the prior probability of the positive class in the entire image.

In the first set of experiments, we implement the NNRE-PU learning algorithm of Kiryo et al. (Kiryo et al., 2017) (Algorithm 1). As the true value of $\pi_{p}$ is unknown for an arbitrary HSI scene, the user (who is expected to have some domain knowledge) has to make an estimate of $\pi_{p}$ from visual inspection of the image. We study how the performance of the classifier varies with perturbations to the true value of $\pi_{p}$ .

In our second set of experiments, we evaluate PN classification based PU learning (PN-PU). Algorithm 2 describes the workflow. A novel spectral-spatial retrieval model, described in Section 3.2, is used to model the conditional positive class probability $Pr_{+}(\mathbf{X}|\mathbf{X}\in\mathcal{I}_{U})$ of an unlabelled pixel $\mathbf{X}$ . Negative samples are drawn from $Pr_{-}(\mathbf{X}|\mathbf{X}\in\mathcal{I}_{U})=1-Pr_{+}(\mathbf{X}|\mathbf{X}\in\mathcal{I}_{U})$ . A PN classifier having the BASS-Net architecture is then trained on the labelled positive and sampled negative examples.

3.1. Heuristic-based Probability Estimates

We explore two heuristics for modeling $Pr_{+}(\mathbf{X}|\mathbf{X}\in\mathcal{I}_{U})$ , the conditional positive class probability of an unlabelled pixel $X$ .

3.1.1. Spatial distance based

According to this heuristic, the conditional probability of an unlabelled pixel belonging to the positive class decreases with its Euclidean distance from the nearest labelled positive sample. Let $d_{+}(\mathbf{X})=min_{\mathbf{X}^{^{\prime}}\in\mathcal{I}_{L+}}||\mathbf{X}-\mathbf{X}^{^{\prime}}||_{2}$ be the Euclidean distance of $\mathbf{X}$ from the nearest labelled positive pixel. Then we have,

[TABLE]

where baseline $b$ and temperature $T$ are hyperparameters. This heuristic draws from the intuition of spatial continuity of a material.

The primary drawbacks of this heuristic are a) it assumes that the user labels positive pixels uniformly over all occurrences of the positive class in the scene, and b) it does not use any notion of spectral similarity of pixels. Imagine a scene in which the positive class occurs in several disconnected locations of the scene, far away from one another. If the user only labels pixels in one of these locations, the positive pixels from the other locations would have a high chance of being wrongly sampled as negative class by this heuristic – thus affecting the sensitivity of the classifier.

3.1.2. Spectral similarity based

We use an unsupervised segmentation algorithm (PDHG (Zhu et al., 2017a), in our experiments), to cluster the hyperspectral scene into a set of $C$ clusters based on spectral similarity. Suppose an unlabelled sample belongs to a cluster of size $n$ . If $m$ of the samples from its cluster were labelled positive by the user, then, the probability of the unlabelled pixel belonging to the positive class is given by:

[TABLE]

Where, $\epsilon$ is a small positive number ( $10^{-4}$ in our experiments). This way, we sample more pixels of the negative class from regions of the scene that differ significantly in spectral characteristics from the labelled positive class. The main drawback of this heuristic stems from its assumption that each cluster containing positive pixels is likely to contain some user-labelled pixels. This is possible only under two conditions: a) the spectral uniformity of the positive class is high enough for the unsupervised segmentation algorithm to include all the positive pixels in the same cluster, or, b) the user labels the positive pixels uniformly over all the different regions of the HSI scene in which the positive class occurs. Additionally, under this model, every pixel in a cluster gets assigned the same probability of being sampled for the negative class regardless of its spatial-proximity to the labelled positive samples. This can directly affect the specificity of the classifier.

3.2. Spectral-spatial Retrieval Model

The spectral and spatial heuristics make certain assumptions about material distribution and the behavior of the annotator. Although these assumptions seldom hold completely, they do hold true to a certain degree in natural HSI scenes. Our goal is to design a retrieval model that outputs a lower-bound on the estimate of $Pr_{+}(\mathbf{X}|\mathbf{X}\in\mathcal{I}_{U})$ whenever one or more of these assumptions are violated in an image. A multiplicative combination of the spatial (equation 8) and spectral (equation 9) factors described in Section 3.1 achieves this goal and compensates for the drawbacks of the individual heuristics. The conditional probability of an unlabelled pixel $X$ belonging to the positive class under this retrieval model is given by:

[TABLE]

We evaluate our retrieval model on three benchmark HSI datasets and two annotation models that simulate the behavior of a human annotator.

4. Experimental Results

In this section we compare the performances of the methods proposed in Section 3.

4.1. Data Sets

We perform our experiments on three popular hyperspectral image classification data sets – Indian Pines (Baumgardner et al., 2015), Salinas, and Pavia University scene222http://www.ehu.eus/ccwintco/index.php?title=Hyperspectral_Remote_ Sensing_Scenes. Some classes in the Indian Pines data set have very few samples. We reject those classes and select the top $9$ classes by population for experimentation. The problem of insufficient samples is less severe for Salinas and U. Pavia and all the classes are taken into account. We choose Corn-notill (class $2$ ), Stubble (class $6$ ) and Asphalt (class $1$ ) as positive classes for Indian Pines, Salinas and U. Pavia datasets respectively. These materials appear in multiple disconnected patches in their corresponding scenes. This puts to test the ability of our methods to model the dramatic spatial variability of spectral signature in HSI scenes. Additionally, these materials form mid-sized classes in their corresponding datasets. Hence the results we obtain about them from our experiments are statistically significant. We sample $10\%$ of the pixels of the positive class using one of the annotation models described in Section 4.2 to construct the labelled positive set, $\mathcal{I}_{L+}$ . The rest of the pixels constitute the set of unlabelled samples, $\mathcal{I}_{U}$ . For NNRE-PU experiments we uniform-randomly sample $5000$ points from $\mathcal{I}_{U}$ for use in training. In the PN classification experiments, an equal number of negative samples as the labelled positive set are drawn from $Pr_{-}(\mathbf{X}|\mathbf{X}\in\mathcal{I}_{U})=1-Pr_{+}(\mathbf{X}|\mathbf{X}\in\mathcal{I}_{U})$ where $Pr_{+}(\mathbf{X}|\mathbf{X}\in\mathcal{I}_{U})$ is given by equation 10. As different frequency channels have different dynamic ranges, their values are normalized to the range $[0,1]$ using the transformation $f(\cdot)$ defined in equation 11, where $x$ denotes the random variable corresponding to the pixel values of a given channel.

[TABLE]

4.2. Annotation Models

We explore two annotation models for constructing the labelled positive set, $\mathcal{I}_{L+}$ : a) uniform, and b) blob. Imagine that the positive class forms $N$ connected components in an HSI scene – where two pixels are considered connected if and only if they are adjacent to each other. The uniform annotation model samples $\mathcal{I}_{L+}$ uniformly from all the connected components. It models the case in which the user labels positive samples uniformly across all instances of spatial occurrence of the positive class. The blob annotation model, on the other hand, models the more practical case in which the user labels a small blob of positive pixels in one location of the image. We implement blob annotation model by starting at a random positive sample, adding it to $\mathcal{I}_{L+}$ and expanding $\mathcal{I}_{L+}$ by searching and adding the adjoining positive pixels in a breadth first fashion. The sampler never leaves a connected component of positive pixels until all the pixels have been included in $\mathcal{I}_{L+}$ . After that, it shifts to a random positive pixel in a different connected component and repeats the process until the requisite number of positive samples have been drawn.

4.3. Evaluation Metrics

We evaluate our PU learning algorithms in terms of precision, recall, F-score, and area under the receiver operating characteristics curve (AUC) (Powers, 2011).

4.4. Operating Point and Hyperparameter Selection

We choose the operating point of our classifiers to minimize the expected cost of mis-classification of a point in the Receiver Operating Curve (ROC) space (Langdon, 2011) given by:

[TABLE]

$x$ and $y$ are coordinates of the ROC space, and $\alpha$ and $\beta$ are the costs of a false positive and false negative respectively. We assume $\alpha=\beta$ . The expectation is performed on a validation set consisting of $7\%$ of the unlabelled samples in the dataset. The solution for our operating point is the point on the ROC curve that lies on a line of slope $\frac{1-\pi_{p}}{\pi_{p}}$ closest to the north-west corner, $(0,1)$ , of the ROC plot. The temperature $T$ and baseline $b$ hyperparameters are also tuned on the validation set through a grid search and presented in Table 2. The number of clusters $\mathcal{C}$ for PDHG algorithm is set to the number of classes in the respective datasets given in Table 1. All neural networks are trained for $100$ epochs or till over-fitting sets in (validation loss starts increasing), whichever happens first.

4.5. Implementation Platform

The algorithms have been implemented in Python using the Chainer deep learning library (Tokui et al., 2015) for highly optimized training of neural networks. As every execution of the proposed algorithms involve training of a neural network, the computational demand is high. Chainer has native support for multi-core CPU and multi-GPU parallelism. This makes it a natural choice for our application. As Chainer only supports input images with even number of channels, we append a new channel with all zeros to the Pavia University HSI scene. Our code is available open-source on GitHub333https://github.com/HSISeg.

4.6. Results and Discussion

Table 3 presents the numerical outcomes of our experiments. The NNRE-PU numbers correspond to the true values of $\pi_{p}$ for each dataset. Table 4 shows the visual outputs of some sample runs of the proposed algorithms. We make the following observations:

•

The performance of NNRE-PU is highly sensitive to the value of $\pi_{p}$ supplied by the user. Figure 3 shows the variation of precision and recall with the supplied value of $\pi_{p}$ for Indian Pines dataset with uniform sampling of the positive class. This is a major drawback of the NNRE-PU approach because it is hard – even for an expert – to provide an accurate estimate $\pi_{p}$ for the query material in an arbitrary HSI scene.

•

PN-PU, with the right hyperparameter settings in the spectral-spatial retrieval model, gives performance comparable with NNRE-PU – although there is no clear trend of supremacy of any one method across all the different HSI datasets and annotation models. Unlike NNRE-PU, PN-PU does not depend on a user-supplied value for $\pi_{p}$ .

•

We observe that for PN-PU classification with blob-sampling, the neural network tends to over-fit very fast causing the recall (and in some cases, also the precision) on the validation set to drop soon after the commencement of training. An obvious explanation of this phenomenon could be as follows. While high spatial variation of spectral signature is a distinctive feature of hyperspectral images, the positive samples annotated by the user happen to be localized in a small part of the image. These localized set of positive examples fail to capture the entire range of variability of the positive class in the HSI scene. This prevents the neural network from learning the right spatial invariances for the positive class causing the false negative rate shoot up. In addition to this, imperfect sampling of the negative class from $\mathcal{I}_{U}$ also introduces some positive samples that are labelled negative in the training set. This further contributes to the false negative rate. In order to address this problem, we stop the training as soon as the recall on the validation set starts to drop.

5. Conclusion

This paper takes a novel approach to HSI classification by formulating it in the PU learning paradigm. The result is a framework that is material, device and platform agnostic and can perform large scale information retrieval in arbitrary HSI scenes. We propose two approaches to solve the HSI classification problem in this framework and preliminary results on benchmark HSI datasets show promising performance. A notable drawback of the proposed approaches is the fact that every execution of the algorithms requires retraining a neural network. This poses substantial computational burden. One possible way to ameliorate this is to pre-train a neural network for a related task and retrain only the last layer for PU learning. In traditional information retrieval iterative refinement of the retrieval model based on relevance feedback plays an important role in improving the quality of retrieval from a given dataset. We plan to explore these topics in future work.

Acknowledgements.

We are thankful to Zhu et al. (Zhu et al., 2017a) for sharing their implementation of PDHG clustering. We would also like to thank Kiryo et al (Kiryo et al., 2017) for sharing their code for the generic PU Learning framework. This study was performed as a part of the project titled ”Deep Learning for Automated Feature Discovery in Hyperspectral Images (LDH)” sponsored by Space Applications Centre (SAC), Indian Space Research Organization (ISRO). Anirban Santara’s work in this project was supported by Google India under the Google India Ph.D. Fellowship Award.

Bibliography43

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Baumgardner et al . (2015) Marion F. Baumgardner, Larry L. Biehl, and David A. Landgrebe. 2015. 220 Band AVIRIS Hyperspectral Image Data Set: June 12, 1992 Indian Pine Test Site 3. (Sep 2015). https://doi.org/doi:/10.4231/R 7RX 991C · doi ↗
3Buchel and Ersoy (2018) Julian Buchel and Okan K. Ersoy. 2018. Ladder Networks for Semi-Supervised Hyperspectral Image Classification. Co RR abs/1812.01222 (2018).
4Camps-Valls and Bruzzone (2005) G. Camps-Valls and L. Bruzzone. 2005. Kernel-based methods for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 43, 6 (2005), 1351–1362.
5Camps-Valls et al . (2007) Gustavo Camps-Valls, Tatyana V. Bandos Marsheva, and Dengyong Zhou. 2007. Semi-Supervised Graph-Based Hyperspectral Image Classification. IEEE Transactions on Geoscience and Remote Sensing 45 (2007), 3044–3054.
6Camps-Valls et al . (2013) Gustavo Camps-Valls, Devis Tuia, Lorenzo Bruzzone, and Jón Atli Benediktsson. 2013. Advances in Hyperspectral Image Classification: Earth monitoring with statistical learning methods. ar Xiv:1310.5107 [cs.CV] (2013).
7Chambolle and Pock (2011) Antonin Chambolle and Thomas Pock. 2011. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of mathematical imaging and vision 40, 1 (2011), 120–145.
8Cui et al . (2018) Binge Cui, Xiaoyun Xie, Siyuan Hao, Jiandi Cui, and Yan Lu. 2018. Semi-Supervised Classification of Hyperspectral Images Based on Extended Label Propagation and Rolling Guidance Filtering. Remote Sensing 10 (2018), 515.

Experiment Name	Training Data	Prediction on the Test Set	Test Set Confusion Map

NNRE-PU on Indian Pines with uniform sampling retrieval model
PN-PU on Indian Pines with uniform sampling retrieval model
NNRE-PU on Salinas with blob sampling retrieval model.
PN-PU on Pavia University with blob sampling retrieval model.