Variational Nonparametric Discriminant Analysis

Weichang Yu; Lamiae Azizi; John T. Ormerod

arXiv:1812.03648·stat.ME·August 28, 2019·Comput. Stat. Data Anal.

Variational Nonparametric Discriminant Analysis

Weichang Yu, Lamiae Azizi, John T. Ormerod

PDF

Open Access 1 Repo

TL;DR

This paper introduces a Bayesian nonparametric discriminant analysis model that effectively performs variable selection and classification in high-dimensional data without relying on restrictive distributional assumptions.

Contribution

It proposes a novel framework using Pólya tree priors and collapsed variational Bayes inference, enabling flexible, low-cost classification with interpretable decision rules.

Findings

01

Performs well on simulated datasets

02

Outperforms current state-of-the-art methods

03

Provides interpretable decision rules

Abstract

Variable selection and classification are common objectives in the analysis of high-dimensional data. Most such methods make distributional assumptions that may not be compatible with the diverse families of distributions data can take. A novel Bayesian nonparametric discriminant analysis model that performs both variable selection and classification within a seamless framework is proposed. P{\'o}lya tree priors are assigned to the unknown group-conditional distributions to account for their uncertainty, and allow prior beliefs about the distributions to be incorporated simply as hyperparameters. The adoption of collapsed variational Bayes inference in combination with a chain of functional approximations led to an algorithm with low computational cost. The resultant decision rules carry heuristic interpretations and are related to an existing two-sample Bayesian nonparametric…

Tables4

Table 1 Pólya tree construction scheme

1. Construct the dyadic tree in Figure 1 by specifying recursive partitions

of the domain

B

. Note that each tree layer is a partition of

B

and the

recursive relationship between successive layers is:

B = B_{0} \cup B_{1}

,

B_{0} = B_{00} \cup B_{01}

,

B_{1} = B_{10} \cup B_{11}

and so on.

2. Set

Π = {B, B_{0}, B_{1}, B_{00}, B_{01}, \dots}

as the collection of partition-subsets

obtained by taking the union of the partitions in step 1.

Remark: Notice that the partition-subsets are enumerated with binary

representations. These representations carry information about the path

down the tree which is taken to reach the subset from the start point at

B

.

In particular, a ‘0’ indicates branching in the leftward direction, whereas ‘1’

denotes branching rightwards. For example, the subset

B_{100}

denotes

branching right from

B \to B_{1}

followed by branching left from

B_{1} \to B_{10}

and finally branching left again from

B_{10} \to B_{100}

.

3. Specify an infinite set of non-negative numbers:

𝒜 = {α_{0}, α_{1}, α_{00}, α_{01}, α_{10}, α_{11}, \dots}

.

4. Attach random probabilities at all edges on the tree. For every binary

representation

ϵ

, draw independently

θ_{ϵ} \sim Beta ​ (α_{ϵ ​ 0}, α_{ϵ ​ 1}) .

5. Set the probability of each partition-subset as the product of the

probabilities along the path taken. For example,

p ​ (B_{01}) = θ ​ (1 - θ_{0})

.

Table 2 Iterative scheme for obtaining the parameters of the optimal

densities

q ​ (𝜸, 𝐲^{(n ​ e ​ w)})

.

Require: For each

j

, initialise

ω_{j}^{(0)}

with a number in

(0, 1)

.

while

{‖ 𝝎^{(t)} - 𝝎^{(t - 1)} ‖}^{2}

is greater than tolerance

τ

, do

At iteration

t

, compute for

j = 1, \dots, p

,

1:

η_{j}^{(t)} \leftarrow \ln {BF}_{j} + \ln {1 + 𝟏^{T} ​ 𝝎_{1 : j - 1}^{(t)} + 𝟏^{T} ​ 𝝎_{j + 1 : p}^{(t - 1)}}

- \ln {p^{u} + p - 𝟏^{T} ​ 𝝎_{1 : j - 1}^{(t)} - 𝟏^{T} ​ 𝝎_{j + 1 : p}^{(t - 1)} - 1}

2:

ω_{j}^{(t)} \leftarrow expit ​ (η_{j}^{(t)})

Upon convergence of

𝝎

, compute for

r = 1, \dots, m

,

3:

ψ_{r} \leftarrow expit ​ [\ln (a_{y} + n_{1}) - \ln (b_{y} + n_{0}) + 𝝎^{T} ​ \ln (𝝅_{r}^{(1)}) - 𝝎^{T} ​ \ln (𝝅_{r}^{(0)})]

Table 3 Method for choosing $c_{1}, \dots, c_{p}$
1. Under $H_{0 j}$ , we assess the goodness of fit of $F_{j}$ to $G_{j}$ for each $j$ by the
p-value of an appropriate hypothesis test, eg. the Shapiro-wilk’s test.
Here, a large p-value favours a large $c_{j}$ .
2. Under $H_{1 j}$ , we assess the distance between $F_{j 1}$ and $F_{j 0}$ by computing
the p-value of a Kolmogorov-Smirnov test. Here, a large p-value favours
a large $c_{j}$ .
3. Denote the p-values computed in steps 1 and 2 by ${v_{j 0}}_{j = 1}^{p}$ and ${v_{j 1}}_{j = 1}^{p}$
respectively. Let ${\tilde{v}}_{j}$ be the p-value of whichever $H_{0 j}$ or $H_{1 j}$ is true, i.e.
${\tilde{v}}_{j} = v_{j 0}$ if $H_{0 j}$ is true; ${\tilde{v}}_{j} = v_{j 1}$ otherwise. Calculate the prior expected
value of ${\tilde{v}}_{j}$ :
$E_{j} = 𝔼 ({\tilde{v}}_{j}) = (v_{j 1} + p^{u} \times v_{j 0}) / (1 + p^{u})$ .
4. Rank the expected values in ascending order: $E_{(1)}, \dots, E_{(p)}$ .
5. Assign $c_{j} = {\begin{matrix} a_{1}, & if E_{j} < E_{(⌊ p / 4 ⌋)}; \\ a_{2}, & if E_{(⌊ p / 4 ⌋)} \leq E_{j} < E_{(⌊ p / 2 ⌋)}; \\ a_{3}, & if E_{(⌊ p / 2 ⌋)} \leq E_{j} < E_{(⌊ 3 p / 4 ⌋)}; \\ a_{4}, & if E_{j} \geq E_{(⌊ 3 p / 4 ⌋)}, \end{matrix}$
where $a_{1} \leq a_{2} \leq a_{3} \leq a_{4}$ are constants that may be chosen from the range
$(0, 100]$ to minimise resubstitution classification error in the training data.

Table 4. Table 4: Accuracy ( ( T P + T N ) / ( P + N ) × 100 𝑇 𝑃 𝑇 𝑁 𝑃 𝑁 100 (TP+TN)/(P+N)\times 100 %) of variable selection for each simulation setting ( n = 100 𝑛 100 n=100 ), where T P 𝑇 𝑃 TP equals the no. of true positives, T N 𝑇 𝑁 TN equals the no. of true negatives, P 𝑃 P equals the no. of positives and N 𝑁 N equals the no. of negatives.

Sim. Set.	VLDA	VQDA	penLDA	NSC	naïveBayesKernel	VNPDA
Sim. 1	89.91	75.72	83.07	87.13	93.10	97.62
Sim. 2	97.02	76.22	96.58	95.87	94.38	93.36
Sim. 3	90.40	74.37	82.72	89.48	92.25	99.09
Sim. 4	89.93	81.84	79.90	86.36	95.55	96.60
Sim. 5	89.97	72.52	82.69	87.32	85.68	90.00
Sim. 6	97.85	81.98	97.88	96.12	92.66	92.48

Equations74

x_{ij}\;|\;y_{i}\stackrel{{\scriptstyle\mbox{\scriptsize iid.}}}{{\sim}}\left\{\begin{array}[]{ll}F_{j1},&\mbox{ if $y_{i}=1$; and}\\[4.30554pt] F_{j0},&\mbox{ if $y_{i}=0$}.\end{array}\right.\\

x_{ij}\;|\;y_{i}\stackrel{{\scriptstyle\mbox{\scriptsize iid.}}}{{\sim}}\left\{\begin{array}[]{ll}F_{j1},&\mbox{ if $y_{i}=1$; and}\\[4.30554pt] F_{j0},&\mbox{ if $y_{i}=0$}.\end{array}\right.\\

\text{If }\gamma_{j}=1,\text{ then }\;x_{ij}\;|\;y_{i}\stackrel{{\scriptstyle\mbox{\scriptsize iid.}}}{{\sim}}\left\{\begin{array}[]{ll}F_{j1},&\mbox{if $y_{i}=1$;}\\[8.61108pt] F_{j0},&\mbox{if $y_{i}=0$;}\end{array}\right.

\text{If }\gamma_{j}=1,\text{ then }\;x_{ij}\;|\;y_{i}\stackrel{{\scriptstyle\mbox{\scriptsize iid.}}}{{\sim}}\left\{\begin{array}[]{ll}F_{j1},&\mbox{if $y_{i}=1$;}\\[8.61108pt] F_{j0},&\mbox{if $y_{i}=0$;}\end{array}\right.

x_{ij} \sim \mbox ii d . F_{j} .

x_{ij} \sim \mbox ii d . F_{j} .

y_{i} ∣ ρ_{y} \sim \mbox ii d . Bernoulli (ρ_{y}),

y_{i} ∣ ρ_{y} \sim \mbox ii d . Bernoulli (ρ_{y}),

ρ_{y} \sim Beta (a_{y}, b_{y}) .

ρ_{y} \sim Beta (a_{y}, b_{y}) .

γ_{j} ∣ ρ_{γ} \sim \mbox ii d . Bernoulli (ρ_{γ}), 1 \leq j \leq p .

γ_{j} ∣ ρ_{γ} \sim \mbox ii d . Bernoulli (ρ_{γ}), 1 \leq j \leq p .

ρ_{γ} \sim Beta (1, p^{u}), for some u > 1.

ρ_{γ} \sim Beta (1, p^{u}), for some u > 1.

F_{j 1}, F_{j 0}, F_{j} \sim P T (Π_{j}, A_{j}),

F_{j 1}, F_{j 0}, F_{j} \sim P T (Π_{j}, A_{j}),

\displaystyle\alpha_{j,\epsilon 1}=\alpha_{j,\epsilon 0}=\left\{\begin{array}[]{ll}1,&\mbox{ if $l=0$; and}\\[8.61108pt] c_{j}\times l^{2},&\mbox{ if $l\geq 1$},\end{array}\right.

\displaystyle\alpha_{j,\epsilon 1}=\alpha_{j,\epsilon 0}=\left\{\begin{array}[]{ll}1,&\mbox{ if $l=0$; and}\\[8.61108pt] c_{j}\times l^{2},&\mbox{ if $l\geq 1$},\end{array}\right.

B_{ϵ} = (G^{- 1} {\frac{1}{2 ^{ℓ}} h = 1 \sum ℓ ϵ_{h} 2^{ℓ - h}}, G^{- 1} {\frac{1}{2 ^{ℓ}} (1 + h = 1 \sum ℓ ϵ_{h} 2^{ℓ - h})}],

B_{ϵ} = (G^{- 1} {\frac{1}{2 ^{ℓ}} h = 1 \sum ℓ ϵ_{h} 2^{ℓ - h}}, G^{- 1} {\frac{1}{2 ^{ℓ}} (1 + h = 1 \sum ℓ ϵ_{h} 2^{ℓ - h})}],

p (γ, y^{(n e w)} ∣ x, y, x^{(n e w)})

p (γ, y^{(n e w)} ∣ x, y, x^{(n e w)})

q (γ, y^{(n e w)}) = r = 1 \prod m q (y_{n + r}) j = 1 \prod p q (γ_{j})

q (γ, y^{(n e w)}) = r = 1 \prod m q (y_{n + r}) j = 1 \prod p q (γ_{j})

E_{q} [ln {\frac{q ( γ , y ^{(n e w)} )}{p ( x , y , x ^{(n e w)} , y ^{(n e w)} , γ )}}],

E_{q} [ln {\frac{q ( γ , y ^{(n e w)} )}{p ( x , y , x ^{(n e w)} , y ^{(n e w)} , γ )}}],

q(\gamma_{j})\propto\exp\left[{\mathbb{E}}_{-q(\gamma_{j})}\bigg{\{}\ln p({\bf x},{\bf y},{\bf x}^{(new)},{\bf y}^{(new)},{\boldsymbol{\gamma}})\bigg{\}}\right],\quad\mbox{and}

q(\gamma_{j})\propto\exp\left[{\mathbb{E}}_{-q(\gamma_{j})}\bigg{\{}\ln p({\bf x},{\bf y},{\bf x}^{(new)},{\bf y}^{(new)},{\boldsymbol{\gamma}})\bigg{\}}\right],\quad\mbox{and}

q(y_{n+r})\propto\exp\left[{\mathbb{E}}_{-q(y_{n+r})}\bigg{\{}\ln p({\bf x},{\bf y},{\bf x}^{(new)},{\bf y}^{(new)},{\boldsymbol{\gamma}})\bigg{\}}\right],

q(y_{n+r})\propto\exp\left[{\mathbb{E}}_{-q(y_{n+r})}\bigg{\{}\ln p({\bf x},{\bf y},{\bf x}^{(new)},{\bf y}^{(new)},{\boldsymbol{\gamma}})\bigg{\}}\right],

ω_{j} \approx expit {ln \mbox B F_{j} + ln (1 + 1^{T} ω_{- j}) - ln (p^{u} + p - 1^{T} ω_{- j} - 1)},

ω_{j} \approx expit {ln \mbox B F_{j} + ln (1 + 1^{T} ω_{- j}) - ln (p^{u} + p - 1^{T} ω_{- j} - 1)},

ψ_{r}

ψ_{r}

π_{r j}^{(k)}

π_{r j}^{(k)}

\displaystyle\hskip 71.13188pt-\ln\Big{\{}2\alpha_{j,\epsilon_{rj}(\ell+1)}+n_{j,\epsilon_{rj}(\ell)}^{(k)}\Big{\}}\bigg{\}}\Bigg{]},

{\boldsymbol{\omega}}^{T}\ln\big{\{}{\boldsymbol{\pi}}_{r}^{(1)}\big{\}}-{\boldsymbol{\omega}}^{T}\ln\big{\{}{\boldsymbol{\pi}}_{r}^{(0)}\big{\}}=\sum_{j=1}^{p}\omega_{j}\left\{\ln(\pi_{rj}^{(1)}/\pi_{rj}^{(0)})\right\},

{\boldsymbol{\omega}}^{T}\ln\big{\{}{\boldsymbol{\pi}}_{r}^{(1)}\big{\}}-{\boldsymbol{\omega}}^{T}\ln\big{\{}{\boldsymbol{\pi}}_{r}^{(0)}\big{\}}=\sum_{j=1}^{p}\omega_{j}\left\{\ln(\pi_{rj}^{(1)}/\pi_{rj}^{(0)})\right\},

ω_{j}

ω_{j}

= expit [E_{- q (γ_{j})} {ln (1 + 1^{T} γ_{- j}) - ln (p^{u} + p - 1^{T} γ_{- j} - 1) + ln \mbox B F_{j}}] .

\begin{array}[]{l}\ln\mbox{BF}_{j}=\sum_{\ell=0}^{\infty}\sum_{\epsilon\in\text{\tt bin}(\ell)}\bigg{[}\\ \quad\phantom{+}\ln{\mathcal{B}}\Big{\{}\alpha_{j,\epsilon 0}+n_{j,\epsilon 0}^{(1)}+I(y_{n+1}=1,x_{n+1,j}\in B_{j,\epsilon 0}),\\ \quad\qquad\quad\ \alpha_{j,\epsilon 1}+n_{j,\epsilon 1}^{(1)}+I(y_{n+1}=1,x_{n+1,j}\in B_{j,\epsilon 1})\Big{\}}\\ \quad+\ln{\mathcal{B}}\Big{\{}\alpha_{j,\epsilon 0}+n_{j,\epsilon 0}^{(0)}+I(y_{n+1}=0,x_{n+1,j}\in B_{j,\epsilon 0}),\\ \quad\qquad\quad\ \alpha_{j,\epsilon 1}+n_{j,\epsilon 1}^{(0)}+I(y_{n+1}=0,x_{n+1,j}\in B_{j,\epsilon 1})\Big{\}}\\ \quad-\ln{\mathcal{B}}\Big{\{}\alpha_{j,\epsilon 0}+n_{j,\epsilon 0}+I(x_{n+1,j}\in B_{j,\epsilon 0}),\\ \quad\qquad\quad\ \alpha_{j,\epsilon 1}+n_{j,\epsilon 1}+I(x_{n+1,j}\in B_{j,\epsilon 1})\Big{\}}-\ln{\mathcal{B}}(\alpha_{j,\epsilon 0},\,\alpha_{j,\epsilon 1})\bigg{]},\end{array}

\begin{array}[]{l}\ln\mbox{BF}_{j}=\sum_{\ell=0}^{\infty}\sum_{\epsilon\in\text{\tt bin}(\ell)}\bigg{[}\\ \quad\phantom{+}\ln{\mathcal{B}}\Big{\{}\alpha_{j,\epsilon 0}+n_{j,\epsilon 0}^{(1)}+I(y_{n+1}=1,x_{n+1,j}\in B_{j,\epsilon 0}),\\ \quad\qquad\quad\ \alpha_{j,\epsilon 1}+n_{j,\epsilon 1}^{(1)}+I(y_{n+1}=1,x_{n+1,j}\in B_{j,\epsilon 1})\Big{\}}\\ \quad+\ln{\mathcal{B}}\Big{\{}\alpha_{j,\epsilon 0}+n_{j,\epsilon 0}^{(0)}+I(y_{n+1}=0,x_{n+1,j}\in B_{j,\epsilon 0}),\\ \quad\qquad\quad\ \alpha_{j,\epsilon 1}+n_{j,\epsilon 1}^{(0)}+I(y_{n+1}=0,x_{n+1,j}\in B_{j,\epsilon 1})\Big{\}}\\ \quad-\ln{\mathcal{B}}\Big{\{}\alpha_{j,\epsilon 0}+n_{j,\epsilon 0}+I(x_{n+1,j}\in B_{j,\epsilon 0}),\\ \quad\qquad\quad\ \alpha_{j,\epsilon 1}+n_{j,\epsilon 1}+I(x_{n+1,j}\in B_{j,\epsilon 1})\Big{\}}-\ln{\mathcal{B}}(\alpha_{j,\epsilon 0},\,\alpha_{j,\epsilon 1})\bigg{]},\end{array}

\begin{array}[]{rcl}{\mathbb{E}}_{-q_{j}}\ln(1+{\bf 1}^{T}{\boldsymbol{\gamma}}_{-j})&\approx&\ln(1+{\bf 1}^{T}{\boldsymbol{\omega}}_{-j}),\\[4.30554pt] {\mathbb{E}}_{-q_{j}}\ln(p^{u}+p-{\bf 1}^{T}{\boldsymbol{\gamma}}_{-j}-1)&\approx&\ln(p^{u}+p-{\bf 1}^{T}{\boldsymbol{\omega}}_{-j}-1),\end{array}

\begin{array}[]{rcl}{\mathbb{E}}_{-q_{j}}\ln(1+{\bf 1}^{T}{\boldsymbol{\gamma}}_{-j})&\approx&\ln(1+{\bf 1}^{T}{\boldsymbol{\omega}}_{-j}),\\[4.30554pt] {\mathbb{E}}_{-q_{j}}\ln(p^{u}+p-{\bf 1}^{T}{\boldsymbol{\gamma}}_{-j}-1)&\approx&\ln(p^{u}+p-{\bf 1}^{T}{\boldsymbol{\omega}}_{-j}-1),\end{array}

\displaystyle\ln{\mathcal{B}}\Big{\{}\alpha_{j,\epsilon 0}+n_{j,\epsilon 0}^{(1)}+I(y_{n+1}=1,x_{n+1,j}\in B_{j,\epsilon 0}),

\displaystyle\ln{\mathcal{B}}\Big{\{}\alpha_{j,\epsilon 0}+n_{j,\epsilon 0}^{(1)}+I(y_{n+1}=1,x_{n+1,j}\in B_{j,\epsilon 0}),

\displaystyle\hskip 56.9055pt\alpha_{j,\epsilon 1}+n_{j,\epsilon 1}^{(1)}+I(y_{n+1}=1,x_{n+1,j}\in B_{j,\epsilon 1})\Big{\}}

\approx ln B (α_{j, ϵ 0} + n_{j, ϵ 0}^{(1)}, α_{j, ϵ 1} + n_{j, ϵ 1}^{(1)}) .

ω_{j}

ω_{j}

\displaystyle\ln\mbox{BF}_{j}\approx\sum_{\ell=0}^{\infty}\sum_{\epsilon\in\text{\tt bin}(\ell)}\bigg{\{}\ln{\mathcal{B}}(\alpha_{j,\epsilon 0}+n_{j,\epsilon 0}^{(1)},\,\alpha_{j,\epsilon 1}+n_{j,\epsilon 1}^{(1)})-\ln{\mathcal{B}}(\alpha_{j,\epsilon 0},\,\alpha_{j,\epsilon 1})

\displaystyle\ln\mbox{BF}_{j}\approx\sum_{\ell=0}^{\infty}\sum_{\epsilon\in\text{\tt bin}(\ell)}\bigg{\{}\ln{\mathcal{B}}(\alpha_{j,\epsilon 0}+n_{j,\epsilon 0}^{(1)},\,\alpha_{j,\epsilon 1}+n_{j,\epsilon 1}^{(1)})-\ln{\mathcal{B}}(\alpha_{j,\epsilon 0},\,\alpha_{j,\epsilon 1})

\displaystyle+\ln{\mathcal{B}}(\alpha_{j,\epsilon 0}+n_{j,\epsilon 0}^{(0)},\,\alpha_{j,\epsilon 1}+n_{j,\epsilon 1}^{(0)})-\ln{\mathcal{B}}(\alpha_{j,\epsilon 0}+n_{j,\epsilon 0},\,\alpha_{j,\epsilon 1}+n_{j,\epsilon 1})\bigg{\}}.

\displaystyle\ln\mbox{BF}_{j}=\sum_{\ell=0}^{M_{j}}\sum_{\epsilon\in\text{\tt bin}(\ell)}\bigg{\{}\ln{\mathcal{B}}(\alpha_{j,\epsilon 0}+n_{j,\epsilon 0}^{(1)},\,\alpha_{j,\epsilon 1}+n_{j,\epsilon 1}^{(1)})-\ln{\mathcal{B}}(\alpha_{j,\epsilon 0},\,\alpha_{j,\epsilon 1})

\displaystyle\ln\mbox{BF}_{j}=\sum_{\ell=0}^{M_{j}}\sum_{\epsilon\in\text{\tt bin}(\ell)}\bigg{\{}\ln{\mathcal{B}}(\alpha_{j,\epsilon 0}+n_{j,\epsilon 0}^{(1)},\,\alpha_{j,\epsilon 1}+n_{j,\epsilon 1}^{(1)})-\ln{\mathcal{B}}(\alpha_{j,\epsilon 0},\,\alpha_{j,\epsilon 1})

\displaystyle+\ln{\mathcal{B}}(\alpha_{j,\epsilon 0}+n_{j,\epsilon 0}^{(0)},\,\alpha_{j,\epsilon 1}+n_{j,\epsilon 1}^{(0)})-\ln{\mathcal{B}}(\alpha_{j,\epsilon 0}+n_{j,\epsilon 0},\,\alpha_{j,\epsilon 1}+n_{j,\epsilon 1})\bigg{\}}.

ψ

ψ

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

weichangyu10/VaDA
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Methods and Mixture Models · Statistical Methods and Inference · Gene expression and cancer classification

Full text

Variational Nonparametric Discriminant Analysis

Weichang Yu

[email protected]

Lamiae Azizi

John T. Ormerod

School of Mathematics and Statistics, University of Sydney

ARC Centre of Excellence for Mathematical and Statistical Frontiers,

The University of Melbourne, Parkville VIC 3010, Australia

Abstract

Variable selection and classification are common objectives in the analysis of high-dimensional data. Most such methods make distributional assumptions that may not be compatible with the diverse families of distributions data can take. A novel Bayesian nonparametric discriminant analysis model that performs both variable selection and classification within a seamless framework is proposed. Pólya tree priors are assigned to the unknown group-conditional distributions to account for their uncertainty, and allow prior beliefs about the distributions to be incorporated simply as hyperparameters. The adoption of collapsed variational Bayes inference in combination with a chain of functional approximations led to an algorithm with low computational cost. The resultant decision rules carry heuristic interpretations and are related to an existing two-sample Bayesian nonparametric hypothesis test. By an application to some simulated and publicly available real datasets, the proposed method exhibits good performance when compared to current state-of-the-art approaches.

keywords:

Variational inference , Bayesian nonparametrics , Pólya trees , Classification , Variable selection , High dimensional statistics.

††journal: Computational Statistics and Data Analysis

1 Introduction

Discriminant analysis (DA) is a classifier that has seen several recent applications in high dimensional data analysis. In the context of two response groups (group 1 and group 0), DA classifies a new observation to group 1 if the likelihood of the variables and response evaluated at group 1 is greater than the likelihood evaluated at group 0. To facilitate the computation of the likelihood, the group-conditional distribution (the conditional distribution given the response) of the predictor-variables is commonly assumed to be Gaussian as with the popular variants such as linear discriminant analysis and quadratic discriminant analysis (Fisher, 1936; McLachlan, 1992).

Drawbacks of traditional formulations of DA for high-dimensional data analysis include the lack of variable selection options, and restrictive distributional assumptions. Variable selection techniques are important because they overcome the gradual accumulation of estimation errors as the number of variables increases (Fan and Lv, 2010). In addition, erroneous assumptions imposed on the diverse range of distributional families from which the variables could be drawn may lead to an inflation of classification errors. While extensions that perform variable selection are available (see for example: Friedman, 1989; Ahdesmäki and Strimmer, 2010; Witten and Tibshirani, 2011; Yu et al., 2018), most of these rely on normality assumptions that do not hold in general. Other extensions perform dimension reduction at the pre-classification stage that sacrifice the interpretability of results which is usually required to shed light on underlying scientific questions (see for example Sugiyama, 2007).

While solutions such as monotonic transformations (see for example Box and Cox, 1964; Benjamini and Speed, 2012) and finite mixture modelling for the group-conditional distributions (Hastie and Tibshirani, 1993; Celeux, 2006) can improve the fit of the data, problems can still arise. In particular, finite mixture modelling is still subject to model misspecification if the number of mixture components and/or the mixture densities are incorrectly specified (Fraley and Raftery, 2002).

Model specification problems can be mitigated by adopting nonparametric approaches as they do not confine the model space to a particular parametric family of distributions. An example of the nonparametric approach is to estimate the unknown group-conditional distributions with kernel density estimators (see Hall and Wand, 1988; Ghosh and Chaudhuri, 2004; Ghosh et al., 2006). However, the density may be undersmoothed in regions of the domain space where there are few observations.

In Bayesian nonparametrics, the unknown distributions are regarded as random measures, and assigned a prior on the space of possible distributions. A popular option for this prior is the Dirichlet process mixture that has been incorporated into several DA models (see for example Fuentes-García et al., 2010; Ouyang and Liang, 2017). One alternative has been proposed by Gutiérrez et al. (2014). In their two-stage variable selection and classification procedure for high dimensional data, they assigned a Gaussian process prior in a penalised spline model to identify informative variables, and then a geometric-weighted mixture model for the classification step.

An alternative nonparametric prior is the Pólya tree (Mauldin et al., 1992). Unlike the other Bayesian nonparametric priors, this prior allows information about the unknown distribution to be incorporated as the prior expectation. The Pólya tree has been applied extensively to density estimation, and Bayesian nonparametric hypothesis testing problems (Hanson and Johnson, 2002; Berger and Guglielmi, 2001; Ma and Wong, 2011; Holmes et al., 2015; Cipolli et al., 2016; Filippi and Holmes, 2017). In particular, Chen and Hanson (2014) proposed a Bayesian nonparametric analogue of the multiple samples test that demonstrated superior results to a similar model (Holmes et al., 2015) by assigning a Pólya tree to each of the unknown population distribution. Cipolli et al. (2016) proposed a multiple hypothesis testing framework, and assigned a Pólya tree hyper-prior for the unknown mean parameters of a Gaussian model. However, according to our knowledge, only Cipolli and Hanson (2018) attempted to use multivariate Pólya tree prior on the joint distribution of the variables in the context of classification. Although their proposed method seems to have good performance in most suitable datasets, additional pre-processing is required in high dimensional settings.

In this paper, we propose a novel dual-objective Bayesian nonparametric DA model that makes inference for variable selection and classification in a unified framework. This is achieved by introducing a set of variable selection parameters into the hierarchical framework of the Bayesian DA model, and an independence assumption between variables to handle high dimensionality. We assign a Pólya tree prior to the unknown group-conditionals as a representation of our uncertainty and use collapsed variational Bayes inference to obtain posterior estimates. This combination of prior choice and posterior inference method lead to a classification rule that carries a heuristic interpretation. The heavy computational burden that comes with fitting a nonparametric model is reduced to an acceptable range through several computational shortcuts that follow from functional approximations. This makes our proposed model appealing for analysing high dimensional data.

Our approach builds on the previous works described above, but differs in several ways related to the use of the Pólya tree prior. In Cipolli et al. (2016), the Pólya tree is assigned as a hyperprior to unknown Gaussian means of the data, whereas in our model the Pólya tree is assigned as the prior of the unknown group-conditional distributions that the variables are drawn from. Cipolli and Hanson (2018) assigned a Pólya tree prior on the joint group-conditional distribution of the variables instead of having $p$ univariate Pólya tree priors, as is the case with our model. Both of these papers have been proposed for low dimensional data analysis in contrast to our approach that was specifically designed to handle situations in which the number of variables ( $p$ ) is greater than the sample size ( $n$ ). Furthermore, we have simplified the model for high dimensional datasets by assuming independence between variables and assigned a separate Pólya tree prior to the group-conditional distribution of each variable.

In Section 2, we will provide a description of our proposed unified model for high dimensional classification and justify the choice of priors. This will also include a brief description of the Pólya tree construction scheme. In Section 3, we will elaborate on the posterior inference of our model, and the heuristic interpretation of our resultant classification rule. This will be followed by a short Section 4 that discusses the setting of a hyperparameter in our model. In Section 5, we compare our proposed model with existing options in simulated, and gene expression datasets. Circumstances that lead to good performance of the model will also be discussed. Finally, we will suggest possible extensions, and conclude in Section 6.

2 Discriminant analysis with variable selection

Consider a data set consisting of $n$ observations $\{(y_{i},{\bf x}_{i}),i=1,\ldots,n\}$ , where ${\bf x}_{i}=(x_{i1},\ldots,x_{ip})^{T}\in{\mathbb{R}}^{p}$ are $p$ predictor variables and $y_{i}$ is the binary response of the $i-$ th observation respectively. In this paper, we focus on the case of continuous variables, and will defer the discussion of an extension to other variable types to Section 6. By conditioning on the responses $(y_{1},\ldots,y_{n})^{T}$ , we assume independence between observations, i.e., ${\bf x}_{u}\,\raisebox{0.50003pt}{\rotatebox[origin={c}]{90.0}{$ \models $}}\,{\bf x}_{v}$ where $u\neq v$ , and between variables, i.e., $x_{ih}\,\raisebox{0.50003pt}{\rotatebox[origin={c}]{90.0}{$ \models $}}\,x_{ig}$ where $h\neq g$ . The latter is a common assumption for high-dimensional DA models known as the naïve Bayes assumption (see Fan and Fan, 2008; Ahdesmäki and Strimmer, 2010; Witten, 2011; Witten and Tibshirani, 2011). This assumption allows us to circumvent infeasible computations imposed in high dimensional settings ( $p\gg n$ ), and has been shown to perform reasonably well under some conditions (Bickel and Levina, 2004; Yu et al., 2018).

Here, we describe a modification of the usual DA model described in McLachlan (1992) that allows for variable selection in the context of pairwise independent variables. Given two distributions $F_{j1}$ and $F_{j0}$ , the group-conditional distributions of the usual discriminant analysis are

[TABLE]

To extend this model to select discriminative variables, we shall introduce a set of binary variable selection parameters ${\boldsymbol{\gamma}}=(\gamma_{1},\ldots,\gamma_{p})^{T}.$ Given three distributions $F_{j}$ , $F_{j1}$ and $F_{j0}$ , each $\gamma_{j}$ controls the sampling scheme of $X_{ij}$ for $1\leq i\leq n$ as follows:

[TABLE]

and if $\gamma_{j}=0$ , then

[TABLE]

Note that $\gamma_{j}=1$ corresponds to a case in which variable $j$ is discriminative while $\gamma_{j}=0$ corresponds to a non-discriminative case.

The binary responses are distributed as

[TABLE]

where the parameter $\rho_{y}$ may be interpreted as the prior probability of sampling an observation from response group 1.

2.1 Priors for $\rho_{y}$ and ${\boldsymbol{\gamma}}$

In most applications, the population proportion $\rho_{y}$ is unknown and this has also been the case with the data which we have analysed in Section 5. A Bayesian solution is to assign a hyper prior distribution for $\rho_{y}$ , and a natural choice of prior would be the beta distribution, i.e.,

[TABLE]

Due to the beta-binomial conjugacy, this choice leads to a closed form expression for the joint density of the model marginalised over $\rho_{y}$ which is useful for posterior inference.

A natural choice of prior for the binary variable selection parameters, $\gamma_{1},\ldots,\gamma_{p}$ , is the Bernoulli distribution, i.e.,

[TABLE]

The parameter $\rho_{\gamma}$ may be interpreted as the proportion of discriminative variables in the dataset. Following the class of complexity priors (Castillo et al., 2015) that has demonstrated the ability to down-weight high-dimension models while allocating sufficient prior probability to the true model in several problems, we have chosen the hyper prior

[TABLE]

The choice of the constant $u$ affects the penalty of the resultant variable selection rule as described later in Section 3.

2.2 Priors for unknown distributions

Yu et al. (2018) modelled the distributions $F_{j1}$ , $F_{j0}$ and $F_{j}$ , $1\leq j\leq p$ described in (1) and (2) as Gaussian. As discussed in the introduction this is restrictive. Instead we will treat the $F$ ’s in this paper as unknown probability measures and assign them a Pólya tree prior (Lavine, 1992). This prior is a distribution defined on a family of distributions on a domain $B$ . Hence, one draw from a Pólya tree is a particular probability distribution. The process to draw a distribution from a Pólya tree is described in Table 2.2.

Following the Pólya tree prior construction scheme, we specify our priors for the unknown distributional components of variable $j$ as

[TABLE]

for some collections of partition-subsets $\Pi_{j}$ and non-negative numbers ${\mathcal{A}}_{j}$ .

We consider the setting of hyperparameters $\Pi_{j}$ and ${\mathcal{A}}_{j}$ . Their choices should depend on the domain and prior information about the unknown distributional components of variable $j$ . Since we are considering only continuous variables here, we ensure that each of the unknown distributional components is continuous with probability one (Blackwell, 1973) by adopting the canonical choice for ${\mathcal{A}}_{j}$ (Lavine, 1992) as follows. Let bin denote the set of all binary representations. For any $\epsilon\in\mbox{{\tt bin}}$ , set

[TABLE]

where $l$ is the length of $\epsilon$ and $c_{j}$ , $1\leq j\leq p$ are smoothing parameters.

We have allowed the smoothing parameters to vary between variables This differs from other papers (Hanson, 2006; Cipolli et al., 2016) that deal with multivariate data in which a common smoothing parameter $c$ was used instead of a separate smoothing parameter $c_{j}$ for each variable $j$ . Although such a specification incurs extra computational cost, we found it to be a necessary price to pay as it takes into account the varying closeness between the distribution of each variable to its respective Pólya tree centring distribution which we shall describe here.

The partitions $\Pi$ may be specified such that the Pólya tree prior is centred about a (fixed) “centring” distribution $G$ , i.e., ${\mathbb{E}}\{F(A)\}=G(A)$ for all measurable subsets $A\subseteq B$ . This is implemented through the following scheme - for any binary representation $\epsilon$ ,

[TABLE]

where $\ell$ is the length of $\epsilon$ , the binary digit $\epsilon_{h}=0$ if the path involves branching left at layer $h$ of the tree; and $\epsilon_{h}=1$ otherwise. We denote a Pólya tree prior centred about $G$ by $PT(\Pi(G),{\mathcal{A}})$ . In our model, we set $\Pi_{j}=\Pi(G_{j})$ , where $G_{j}$ is the Gaussian distribution parameterised by sample moments as were similarly specified in Chen and Hanson (2014), Holmes et al. (2015), and Cipolli and Hanson (2018). This is in line with other DA models that outrightly assume a Gaussian distribution for the variables. However, unlike these Gaussian DA models, the choice of $G_{j}$ has little influence on the posterior inference as it will be overruled by a sufficient amount of data.

3 Model inference

The objective of our model is to identify the discriminative variables and classify new observations. Suppose we have observed ${\mathcal{Y}}=\{(y_{i},{\bf x}_{i}),1\leq i\leq n\}$ as realisations of the model in Section 2. Let $\{(y_{n+r},{\bf x}_{n+r}),1\leq r\leq m\}$ denote $m$ new observations to be classified. The required posterior is

[TABLE]

where ${\bf y}^{(new)}=(y_{n+1},\ldots,y_{n+m})$ and ${\bf x}^{(new)}=[{\bf x}_{n+1},\ldots,{\bf x}_{n+m}]^{T}$ are the responses and variables of the new observations respectively. Clearly, this posterior is intractable as it requires a sum of $p({\bf x},{\bf y},{\bf x}^{(new)},{\bf y}^{(new)},{\boldsymbol{\gamma}})$ over $2^{p+m}$ combinations of $({\boldsymbol{\gamma}},{\bf y}^{(new)})$ .

We will utilise the collapsed variational Bayes (CVB) as described in Teh et al. (2007) to approximate the required posterior. This choice of posterior inference stems from the scalability of CVB to high dimensional problems. Since we have performed posterior inference with variational inference, we shall name our model the variational nonparametric discriminant analysis (VNPDA).

The CVB approach approximates the actual posterior in (7) with a product of densities of the form

[TABLE]

that minimises the KL-divergence

[TABLE]

where ${\mathbb{E}}_{q}$ is an expectation with respect to $q({\boldsymbol{\gamma}},{\bf y}^{(new)})$ . The product components of this optimal $q$ -density may be obtained via the iterative equation:

[TABLE]

where the expectations are taken with respect to $\prod_{s\neq j}q(\gamma_{s})\prod_{r=1}^{m}q(y_{n+r})$ and $\prod_{j=1}^{p}q(\gamma_{j})\prod_{h\neq r}q(y_{n+h})$ respectively.

Since $\gamma_{j}$ and $y_{n+r}$ are binary random variables, their respective q-densities are Bernoulli probability mass functions and hence we need only to compute their probabilities $\omega_{j}=q(\gamma_{j}=1)$ for $1\leq j\leq p$ , and $\psi_{r}=q(y_{n+r}=1)$ for $1\leq r\leq m$ . These values are computed with a coordinate ascent algorithm (Blei et al., 2017) as described in Table 3. Details on the derivations of the algorithm may be found in A.

The update equations are as follows.

Update for $\omega_{j}$ . Following equation (8), the resultant update equation for $\omega_{j}$ is approximately

[TABLE]

where ${\boldsymbol{\omega}}_{-j}={\mathbb{E}}_{-q_{j}}({\boldsymbol{\gamma}}_{-j})$ , the column vector ${\boldsymbol{\gamma}}_{-j}$ is the result of removing the $j$ th entry from ${\boldsymbol{\gamma}}$ , the function $\text{expit}(z)=(1+e^{-z})^{-1}$ and the term $\ln\mbox{BF}_{j}$ (refer to equation (A) for explicit expression) is the log Bayes factor of a two-sample Bayesian nonparametric test between the hypotheses $\{H_{1j}:\gamma_{j}=1\}$ against $\{H_{0j}:\gamma_{j}=0\}$ (Holmes et al., 2015). Consequently, $\omega_{j}$ is an increasing function of a penalised log Bayes factor.

To keep the computational cost of the CVB algorithm within an acceptable range, we used the following approximations: (i) Taylor’s expansion on nonlinear functions of ${\bf 1}^{T}{\boldsymbol{\gamma}}_{-j}$ , e.g. $\ln(1+{\bf 1}^{T}{\boldsymbol{\gamma}}_{-j})$ ; (ii) Stirling’s approximation (Abramowitz and Stegun, 2002) on beta functions involving $y_{n+r}$ (under the assumption that $1/n$ is small). The reader may refer to A for details on how we have applied the approximations.

Observe that the penalty term $\ln(1+{\bf 1}^{T}{\boldsymbol{\omega}}_{-j})-\ln(p^{u}+p-{\bf 1}^{T}{\boldsymbol{\omega}}_{-j}-1)$ is decreasing in $u$ . Therefore, the setting of $u$ should be considered carefully as it controls the trade-off between errors of type I (selecting truly non-discriminative variables as discriminative) and type II (missing out on truly discriminative variables).

Update for $\psi_{r}$ . The update equation for $\psi_{r}$ is approximately

[TABLE]

where the number of observations from response group $k$ is $n_{k}$ , the column vector ${\boldsymbol{\pi}}_{r}^{(k)}$ of size $p$ is such that the j-th element is

[TABLE]

the binary representation $\epsilon_{rj}(\ell)$ denotes the first $\ell$ branching directions of the path of $x_{n+r,j}$ through the Pólya tree $PT(\Pi(G_{j}),{\mathcal{A}}_{j})$ , the number of observations from response group $k$ that fall in the partition-subset $B_{j,\epsilon}$ is $n_{j,\epsilon}^{(k)}$ , and $N_{j}$ is a constant.

Both Taylor’s and Stirling’s approximations are also used to obtain (11). This allows us to update ${\boldsymbol{\omega}}$ in isolation before using the converged value to calculate each $\psi_{r}$ individually, thus reducing the computational cost.

The final classification rule may be heuristically interpreted as a function of pseudo-proportion ratios. Observe that the term

[TABLE]

is a weighted sum of log proportion ratios. Briefly, the proportion $\pi_{rj}^{(1)}$ is large if a large proportion of observations from group 1 have similar branching directions as $x_{n+r,j}$ . In other words, the classification rule classifies a new observation as group 1 if its path is more similar to paths taken by group 1 observations than those from group 0.

Our approach to computing the marginal log-likelihood of the data is very similar to that described by Holmes et al. (2015). The computational efficiency of the algorithm is of order $O(np)$ based on the settings justified in A. This comes from traversing through $p$ Pólya trees each truncated at layer $\lfloor\ln_{2}(n)\rfloor$ .

4 The smoothing parameter $c_{j}$

The choice of the $c_{j}$ ’s is crucial as it affects both the variable selection and classification rules. Unfortunately, most existing options in the literature such as assigning a hyper-prior and empirical estimation have been proposed in the context of low dimensional problems only and are not easily scalable to high dimensionality settings. Hanson and Johnson (2002), Zhao and Hanson (2011), and Cipolli et al. (2016) dealt with a limited number of smoothing parameters in their respective models by assigning them a hyper-prior. Although these methods yielded good numerical results in their papers, putting hyper-priors on the $c_{j}$ ’s will compound the computational difficulties we already have to tackle. More specifically, a hyper-prior on each $c_{j}$ will lead to an extra $2p$ update steps in our coordinate ascent algorithm (see Table 2). These extra steps may be bypassed by collapsing our model over the $c_{j}$ ’s but it will lead to difficulties with the computation of the $q$ -densities for $\gamma_{j}$ and $y_{n+r}$ . Holmes et al. (2015) found that any value of a single smoothing parameter $c$ between 1 to 10 works well in practice. However, a generalisation to multiple smoothing parameters has not been discussed. Among the empirical estimation methods (Berger and Guglielmi, 2001; Chen and Hanson, 2014; Holmes et al., 2015; Cipolli and Hanson, 2018), the only algorithm that has been applied to a model with multiple $c$ ’s has been proposed in Chen and Hanson (2014). However, this approach involves a $p$ -dimensional grid search in the high-dimensionality context which is infeasible in our context.

Since our context is the analysis of high dimensional data, we suggest a novel heuristic approach of choosing $c_{j}$ based on an a priori analysis that can be executed efficiently. More specifically, we “infer” a likely value of $c_{j}$ from a list of candidate values that has generated our observed data under each of the two possible hypotheses $\{H_{0j}:\gamma_{j}=0\}$ and $\{H_{1j}:\gamma_{j}=1\}$ . Under $H_{0j}$ , the value of $c_{j}$ that generated our data is likely to be large if the empirical distribution of $F_{j}$ is close to $G_{j}$ , whereas under $H_{1j}$ , the value of $c_{j}$ that generated our data is likely to be large if the Euclidean distance between the empirical distributions of $F_{j1}$ and $F_{j0}$ is small. To make the implementation more computationally feasible, the variables are grouped into clusters such that variables within each cluster are assumed to have equal values of $c_{j}$ . The best setting for the $c_{j}$ ’s are selected to minimise resubstitution classification error. This choice of objective function aligns with the classification objective of our proposed model and differs from the log marginal likelihood objective used in Chen and Hanson (2014). Detailed steps of this proposed a priori analysis can be found in Table 4.

5 Numerical results

In this section, we examine the performance of our proposed method in 6 simulation settings and 2 publicly-available gene expression datasets. We fitted our proposed VNPDA model with the publicly available R package that can be found at the website https://github.com/weichangyu10/VaDA and compared its performance with the classifiers - variational linear discriminant analysis (VLDA, Yu et al., 2018), variational quadratic discriminant analysis (VQDA, Yu et al., 2018), penalised-LDA (penLDA, Witten and Tibshirani, 2011), nearest shrunken centroid (NSC, Tibshirani et al., 2003) (NSC) and naïve Bayes kernel discriminant analysis (naiveBayesKernel, Strbenac et al., 2015). Both VLDA and VQDA are Bayesian analogues of naïve Bayes discriminant analysis (McLachlan, 1992) that have exhibited competitive classification errors. The penLDA classifier is a penalised version of Fisher’s discriminant analysis that performs well when the true signal (difference in true group-conditional means divided by standard deviation) is sparse, whereas the NSC classifier has been chosen as a competing classifier due to its popularity in bioinformatics literature. Lastly, the naiveBayesKernel is a two-stage nonparametric classifier that performs variable selection with the Kolmogorov-Smirnov test before fitting the selected variables into a naïve Bayes discriminant analysis model that estimates the group-conditional distributions with kernel density estimates. Further details of the comparison methods have been provided in Section 1 of the Supplementary Material.

5.1 Simulation Study

The models are trained with $n=100$ observations and are used to classify $m=1000$ new observations in each simulation setting for $50$ repetitions. At each repetition, the simulated dataset consists of $50$ truly discriminative variables that follow various non-Gaussian distributions (simulation 2 as an exception), and 450 non-discriminative variables. Details of their distributions are provided below and in Figure 2.

Simulation 1

We compare the models’ performances in discriminating a trimodal distribution from a kurtotic unimodal distribution. The distributions have equal means. These two distributions are mentioned in Marron and Wand (1992).

Group 1: $(9/20){\mathcal{N}}(-6/5,(3/5)^{2})+(9/20){\mathcal{N}}(6/5,(3/5)^{2})+(1/10){\mathcal{N}}(0,(0.25)^{2})$ . Group 0: $(2/3){\mathcal{N}}(0,1)+(1/3){\mathcal{N}}(0,0.1^{2})$ .

Simulation 2

Here, we assess the loss incurred by nonparametric classification when the Gaussian assumption holds.

Group 1: ${\mathcal{N}}(0.7,1)$ .

Group 0: ${\mathcal{N}}(0,1)$ .

Simulation 3

We examine the models’ ability to discriminate distributions that differ by a density spike at $x=0.5$ .

Group 1: $0.5\,{\mathcal{N}}(0,1)+0.5\,{\mathcal{N}}(0.5,0.001^{2})$ .

Group 0: ${\mathcal{N}}(0,1)$ .

Simulation 4

We assess the models’ performances when the group-conditional distributions differ largely by tail thickness.

Group 1: ${\mathcal{N}}(0,1)$ .

Group 0: Cauchy distribution with location [math] and scale $3$ .

Simulation 5

Here, we compare the models’ performances in a challenging classification scenario when the group-conditional distributions differ by an additional minor mode between two major modes. These two distributions are mentioned in Marron and Wand (1992).

Group 1: $(9/20){\mathcal{N}}(-6/5,(3/5)^{2})+(9/20){\mathcal{N}}(6/5,(3/5)^{2})+(1/10){\mathcal{N}}(0,(0.25)^{2})$ .

Group 0: $0.5{\mathcal{N}}(-1,(2/3)^{2})+0.5{\mathcal{N}}(1,(2/3)^{2})$ .

Simulation 6

We assess the models’ performances when discriminating two Exponential distributions of different rates.

Group 1: Exp $(6)$ .

Group 0: Exp $(2)$ .

The non-discriminative variables are partitioned into 9 groups of 50 variables. Within each group, variables are independent, identically distributed. The distributions of each group are: $t_{1}$ , $\mbox{Cauchy}(0,2)$ , $\mbox{Gamma}(2,2)$ , $\mbox{Exp}(1)$ , ${\mathcal{N}}(0,5^{2})$ , ${\mathcal{N}}(0,1)$ , $0.1{\mathcal{N}}(0,1)+0.9{\mathcal{N}}(0,0.1^{2})$ (zero-inflated model),

$\sum_{\ell=0}^{7}(1/8){\mathcal{N}}(3\{(2/3)^{\ell}-1\},(2/3)^{2\ell})$ (multiple modes), and

$0.5{\mathcal{N}}(-1.5,0.5^{2})+0.5{\mathcal{N}}(1.5,0.5^{2})$ (bi-normal).

Results of the simulation study are summarised in Figure 3, Figure 4, and Table 4. The median computation time of VNPDA for one repetition is approximately 55s (an acceptable time) on a 1.6GHz Intel Core i5 processor.

In simulations 1 and 4, three models VNPDA, naiveBayesKernel and VQDA have high selection rates among the discriminative variables. In simulation 1, the group-conditional distributions have well-separated major modes ([math] vs. $\pm 6/5$ ), while the two distributions have differing tail thickness in simulation 4. However, VQDA did not perform as well in variable selection accuracy as it is unable to distinguish noise generated by non-discriminative variables with thick-tailed distributions such as $t_{1}$ and Cauchy $(0,2)$ . This directly leads to VQDA’s poor classification performance. VNPDA performed excellently in simulation 3 and this is evidence of its superiority in detecting density spikes. However, it did not perform better than the other models in simulations 2 and 6. In simulation 2, we expected the nonparametric classifiers naiveBayesKernel and VNPDA to exhibit poorer performance than the other models as the Gaussian assumption holds. In simulation 6, VLDA and penLDA exhibited a slight edge over VNPDA as the nonparametric option did not perform as well in identifying the discriminative variables. This is due to the lack of separation between the modes of the group-conditional distributions.

We repeated the simulation study for various training sample sizes ( $n=30$ and $n=500$ ) and the results are displayed in Section 2 of the Supplementary Material. Classification and variable selection performances generally improve across all models for Simulations 2, 3 and 6 as $n$ increases and remain similar for other simulation settings. The classification errors for VNPDA appear to have stabilised at $n=100$ for all simulation settings except for Simulation 2. When the Gaussian assumption is indeed true, nonparametric DA models require a much larger training sample size to achieve similar classification error with Gaussian DA models. No significant changes in the performance ranking have been observed as sample size changes. These findings suggest that the methods are sample-efficient in Simulations 2, 3 and 6 and that the results hold for a wide range of sample sizes.

Overall, the conditions which appear to be favourable to VNPDA are: (i) differing tail thickness; (ii) well-separated major modes.

5.2 Application to gene expression datasets

Melanoma dataset

The melanoma dataset has been analysed by Mann et al. (2013). The data underwent preprocessing to remove underexpressed genes, i.e. median $\leq 7$ . Patients whose survival time are lesser than 1 year and died due to the disease are labelled as poor prognosis; patients who are still alive and free from Melanoma after 4 years are labelled as good prognosis. Finally, we standardise the dataset to obtain $z_{ij}=(x_{ij}-\overline{x}_{j})/s_{j}$ , where $x_{ij}$ is the gene $j$ reading for observation $i$ . The processed dataset that has been analysed consists of $n=47$ observations by $p=12881$ DNA microarray readings.

Sarcoma dataset

The sarcoma dataset is uploaded by Colaprico et al. (2016) and is made publicly available on bioconductor. The data underwent preprocessing to retain only variables with variance $>0.1$ . Patients whose survival time are lesser than the 20th percentile are labelled as poor prognosis; patients whose survival time are greater than 80th percentile are labelled as good prognosis. The dataset is standardised in a similar manner to the melanoma dataset. The processed dataset that has been analysed consists of $n=74$ observations by $p=20449$ RNA readings.

The classification errors are summarised in Figure 5. The median computation times for one CV iteration are approximately 80s and 200s for the melanoma and sarcoma dataset respectively on a 1.6GHz Intel Core i5 processor. This is slower than other methods but is believed to be within an acceptable range to most users. In terms of classification errors, VNPDA outperformed the Gaussian DA models, including VLDA, in the melanoma dataset but there is no significant difference in performance in the sarcoma dataset. This warrants a more detailed investigation into the reasons that led to this disparity. For simplicity, we shall focus on the performances of VLDA and VNPDA as these classifiers exhibited the greatest contrast in performance between the two gene expression datasets. A visualisation of genes selected by each classifiers has been provided in Section 3 of the Supplementary Material.

Based on the venn diagrams in Figure 6, it is clear that the disparity is due to the number of frequently selected genes. In particular, we found that VNPDA has more frequently selected genes than VLDA in the melanoma dataset, whereas VLDA has more frequently selected genes than VNPDA in the sarcoma dataset.

Next, we shall identify the reason that led to this difference in the number of frequently selected genes. In the melanoma dataset, there are 925 genes selected by VNPDA but not VLDA. Among these genes, a substantial proportion (18.8%) are not selected because their group-conditional distributions exhibited a strong departure from normality assumptions (Shapiro-Wilk’s $p$ -value $<$ 0.05). Figure 7 presents the Pólya predictive density plot (see Hanson and Johnson, 2002, for definition) of $9$ of these genes which shows that their group-conditional distributions are favourable for VNPDA to perform well, i.e., they either differ in tail thickness or have well-separated major modes. For example, the group-conditional distribution of the poor prognosis group (red) for the C1ORF53 gene looks like a kurtotic unimodal distribution.

In contrast, more genes are selected by VLDA than VNPDA in the sarcoma dataset. By plotting the Pólya predictive density of genes that are selected by VLDA but not VNPDA, we observed that many of these genes have unimodal, right skewed group-conditional distributions (see Figure 8). We notice that the positions of their group-conditional major modes nearly coincide with one another. Such conditions lead to a smaller Bayes factor for the VNPDA variable selection rule and hence weaker evidence for the variable to be discriminative. On the other hand, the VLDA variable selection rule performed better as they depend only on the separation between the group-conditional means, and these are indeed well-separated among most of the genes.

6 Discussion

In this paper, we presented a novel Bayesian discriminant analysis model that performs both variable selection and classification without making assumptions about the parametric form of the unknown distributions. To deal with our uncertainty about these distributions, we assigned them with Pólya tree priors. Since the results using the Pólya tree prior are sensitive to the choice of the smoothing parameter, we suggested a data-driven approach based on an a priori inference that helps with the choice of this parameter. By adopting a CVB approximation for posterior inference, we arrive at a classification rule that carries a heuristic interpretation.

As Bayesian nonparametric methods are unpopular for analysing high-dimensional data due to the computational cost, we applied computational short-cuts to the CVB update equations. The approximations effectively isolate the updates of the variable selection probabilities from the classification probabilities. Thus, we compute the variable selection probabilities in an iterative loop before using the converged value for calculating the classification probabilities. This, in combination with an implementation in C++, led to a computation cost that we believe is within an acceptable range (80 to 200s) for most users in both the simulated and publicly available datasets we have examined.

The numerical results indicate that our proposed model performs reasonably well in most cases and is superior when the group-conditional distribution have either well-separated major modes or differing tail thickness. The findings are validated when we examine the group-conditional distributions of the variables in two publicly-available datasets.

Our proposed model structure also allows for several possible extensions. A possible extension that is beyond the scope of the motivating problem is to accommodate categorical variables in three scenarios. In the first scenario whereby all variables are binary, we may model each variable with a pair of Bernoulli-beta group-conditional distributions under the alternative hypothesis and a corresponding common Bernoulli-beta distribution under the null. In the second scenario whereby we have a mix of continuous and binary variables, the continuous variables may be modelled in accordance to equation (2), while binary variables are modelled with the Bernoulli-beta distributions. In the third scenario whereby we have a mix of continuous and nominal variables, each nominal variable may be modelled with a multinomial-Dirichlet distribution. If the variables are ordinal instead of nominal, we may utilise the structure of the categorical ordering by penalising the differences between adjacent categorical probabilities (see for example Gertheiss and Tut, 2009; Witten and Tibshirani, 2011). To reduce the number of parameters to be estimated, the models in all three scenario may be marginalised over the probability parameters. When there is a need to account for correlation between variables, suitable strategies include re-representing each multi-category variable with a set of binary variables (Cipolli and Hanson, 2018) or specifiying a location model (Krzanowski, 1975; Daudin and Bar-Hen, 1999; Mbina et al., 2018).

Another possible extension is to allow for $g>2$ response groups which consequently involves a comparison of $O\{(g/\ln(g))^{g}\}$ number of possible hypotheses (Gian-Carlo, 1964). This increases the computational burden of our algorithm drastically and therefore would be pursued as future research.

By taking these into account, we believe that VNPDA has great potentials in analysing many high dimensional datasets when the normality assumption is questionable.

Funding information and conflict of interest

This research was partially supported by an Australian Postgraduate Award (Weichang Yu); and the Australian Research Council Discovery Project grant DP170100654 (John T. Ormerod).

The authors declare that they have no conflict of interest.

Appendix A Derivation for algorithm in Table 2

We shall present the derivation in the case where we have a single new observation $({\bf x}_{n+1},y_{n+1})$ and describe the generalisation to multiple new observations towards the end. Following (8), the updates for the parameter $\omega_{j}$ may be computed as

[TABLE]

The expression involves an expectation of the log Bayes factor

[TABLE]

where bin $(\ell)$ is the set of all binary representations of length $\ell$ , $I(\cdot)$ is the indicator function, ${\mathcal{B}}(a,b)=\Gamma(a)\Gamma(b)/\Gamma(a+b)$ is the Beta function, $n_{j,\epsilon}^{(k)}$ is the number of group $k$ observations in the partition-subset $B_{j,\epsilon}$ and $n_{j,\epsilon}=n_{j,\epsilon}^{(1)}+n_{j,\epsilon}^{(0)}$ .

Clearly, the expectation in equation $(\ref{exactomega})$ involves nonlinear functions of ${\boldsymbol{\gamma}}_{-j}$ and $y_{n+1}$ . Therefore, we shall utilise some approximations to get around this. We may use Taylor’s expansion about ${\bf 1}^{T}{\boldsymbol{\omega}}_{-j}$ to approximate

[TABLE]

and, for small $1/n$ , a Stirling’s approximation to approximate the Beta functions

[TABLE]

Hence, the update equation for $\omega_{j}$ may be approximated as

[TABLE]

where

[TABLE]

The sum to infinity in the above expression is computationally tractable as the subset counts decreases as we go further down the layers of the tree. In particular, there exists a constant $M_{j}$ such that either $n_{j,\epsilon}^{(1)}=0$ or $n_{j,\epsilon}^{(0)}=0$ for all $\epsilon\in\bigcup_{\ell>M_{j}}$ bin $(\ell)$ and $k\in\{0,1\}$ . Hence, we may rewrite the log Bayes factor as

[TABLE]

Similarly, we use (8) to derive the update for $y_{n+1}$ as

[TABLE]

and if $n$ is large, then

[TABLE]

where the number of observations in groups 1 and 0 are $n_{1}$ and $n_{0}$ respectively, the vector ${\boldsymbol{\pi}}^{(k)}$ of size $p$ is such that the j-th element is

[TABLE]

the $\ln$ prefix of a vector denotes element-wise $\ln$ , the binary representation $\epsilon_{j}(\ell)$ denotes the first $\ell$ branching directions taken by $x_{n+1,j}$ and the number $N_{j}$ is such that $n_{j,\epsilon_{j}(\ell)}^{(k)}=0$ for all $\ell>N_{j}$ and $k\in\{0,1\}$ . Note that the approximation in equation (A) follows from Taylor’s approximation.

We may extend the classification rule in equation (A) to $m>1$ new observations by simply replacing each ${\boldsymbol{\pi}}^{(k)}$ with ${\boldsymbol{\pi}}_{r}^{(k)}$ , where the j-th element of ${\boldsymbol{\pi}}_{r}^{(k)}$ is

[TABLE]

the number $N_{j}$ is such that $n_{j,\epsilon_{rj}(\ell)}^{(k)}=0$ for all $\ell>N_{j}$ and $1\leq r\leq m$ and the rest of the notations used have been explained in Section 3.

Remark: In our numerical examples, we truncate the Pólya tree priors at layers $N_{j}=M_{j}=\ln_{2}(n)$ for all $1\leq j\leq p$ . This allows us to account for the details at higher resolution of the group-conditional distributions as $n$ increases (Hanson and Johnson, 2002).

References

Abramowitz and Stegun (2002)

Abramowitz, M., Stegun, I., 2002.

Handbook of mathematical functions. Wiley.

Ahdesmäki and Strimmer (2010)

Ahdesmäki, M., Strimmer, K., 2010.

Feature selection in omics prediction problems using CAT score and false discovery rate control.

The Annals of Applied Statistics 4, 503–519.

Benjamini and Speed (2012)

Benjamini, Y., Speed, T.P., 2012.

Summarizing and correcting the GC content bias in high-throughput sequencing.

Nucleic Acids Research 40.

doi:10.1093/nar/gks001.

Berger and Guglielmi (2001)

Berger, J.O., Guglielmi, A., 2001.

Bayesian and conditional frequentist testing of a parametric model versus nonparametric alternatives.

Journal of the American Statistical Association 96, 174–184.

Bickel and Levina (2004)

Bickel, P.J., Levina, E., 2004.

Some Theory for Fisher’s linear discriminant function, ‘Naive Bayes’ and some alternatives when there are many more variables than observations.

Bernoulli 10, 989–1010.

Blackwell (1973)

Blackwell, D., 1973.

Discreteness of Ferguson selections.

The Annals of Statistics 1, 356–358.

Blei et al. (2017)

Blei, D., Kucukelbir, A., McAuliffe, J.D., 2017.

Variational inference: a review for statisticians.

Journal of the American Statistical Society 112, 859–877.

Box and Cox (1964)

Box, G.E.P., Cox, D.R., 1964.

An analysis of transformations.

Journal of Royal Statistical Society Series B 26, 211–252.

Castillo et al. (2015)

Castillo, I., Schmidt-Hieber, J., van der Vaart, A., 2015.

Bayesian linear regression with sparse priors.

The Annals of Statistics 43, 1986–2018.

Celeux (2006)

Celeux, G., 2006.

Advances in data analysis. Springer. chapter Mixtrue models for classification.

pp. 3–14.

Chen and Hanson (2014)

Chen, Y., Hanson, T.E., 2014.

Bayesian nonparametric k-sample tests for censored and uncensored data.

Computational Statistics and Data Analysis 71, 335–346.

Cipolli and Hanson (2018)

Cipolli, W., Hanson, T.E., 2018.

Supervised learning via smoothed Polya trees.

Advances in Data Analysis and Classification , 1–28.

Cipolli et al. (2016)

Cipolli, W., Hanson, T.E., McLain, A., 2016.

Bayesian nonparametric multiple testing.

Computational Statistics and Data Analysis 101, 64–79.

Colaprico et al. (2016)

Colaprico, A., Silva, T., Olsen, C., Garofano, L., Cava, C., Garolini, D., Sabedot, T., Malta, T., Pagnotta, S., Castiglioni, I., Ceccarelli, M., Bontempi, G., Noushmehr, H., 2016.

TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data.

Nucleic acids research 44, e71.

Daudin and Bar-Hen (1999)

Daudin, J.J., Bar-Hen, A., 1999.

Selection in discriminant analysis with continuous and discrete variables.

Computational Statistics and Data Analysis 32, 161–175.

Fan and Fan (2008)

Fan, J., Fan, Y., 2008.

High-dimensional classification using features annealed independence rules.

The Annals of Statistics 36, 2605–2637.

Fan and Lv (2010)

Fan, J., Lv, J., 2010.

A selective overview of variable selection in high dimensional feature space.

Statistica Sinica 20, 101–148.

Filippi and Holmes (2017)

Filippi, S., Holmes, C.C., 2017.

A Bayesian nonparametric approach to testing for dependence between random variables.

Bayesian Analysis 12, 919–938.

Fisher (1936)

Fisher, R.A., 1936.

The use of multiple measurements in taxonomic problems.

Annals of Eugenics 7, 179–188.

Fraley and Raftery (2002)

Fraley, C., Raftery, A.E., 2002.

Model-based clustering, discriminant analysis, and density estimation.

Journal of the American Statistical Association 97, 611–631.

Friedman (1989)

Friedman, J.H., 1989.

Regularized discriminant analysis.

Journal of the American Statistical Association 84, 165–175.

Fuentes-García et al. (2010)

Fuentes-García, R., Mena, R.H., G, W.S., 2010.

A probability for classification based on Dirichlet process mixture model.

Journal of Classification 27, 389–403.

Gertheiss and Tut (2009)

Gertheiss, J., Tut, G., 2009.

Penalized regression with ordinal predictors.

International statistical review 77, 345–365.

Ghosh and Chaudhuri (2004)

Ghosh, A.K., Chaudhuri, P., 2004.

Optimal smoothing in kernel discriminant analysis.

Statistica Sinica 14, 457–483.

Ghosh et al. (2006)

Ghosh, A.K., Chaudhuri, P., Sengupta, D., 2006.

Classification using kernel density estimates: multiscale analysis and visualization.

Technometrics 48, 120–132.

Gian-Carlo (1964)

Gian-Carlo, R., 1964.

The number of partitions of a set.

American Mathematical Monthly 71, 498–504.

Gutiérrez et al. (2014)

Gutiérrez, L., Gutiérrez-Peña, E., Mena, R.H., 2014.

Bayesian nonparametric classification for spectroscopy data.

Computational Statistics and Data Analysis 78, 56–68.

Hall and Wand (1988)

Hall, P., Wand, M.P., 1988.

On nonparametric discrimination using density differences.

Biometrika 75, 541–547.

Hanson (2006)

Hanson, T.E., 2006.

Inference for mixtures of finite Pólya tree models.

Journal of the American Statistical Association 101, 1548–1565.

Hanson and Johnson (2002)

Hanson, T.E., Johnson, W.O., 2002.

Modeling regression error with a mixture of Pólya trees.

Journal of the American Statistical Association 97, 1020–1033.

Hastie and Tibshirani (1993)

Hastie, T., Tibshirani, R., 1993.

Discriminant analysis by Gaussian mixtures.

Journal of the Royal Statistical Society Series B 18, 87–95.

Holmes et al. (2015)

Holmes, C.C., Francois, C., Griffin, J.E., Stephens, D.A., 2015.

Two-sample bayesian nonparametric hypothesis testing.

Bayesian Analysis 10, 297–320.

Krzanowski (1975)

Krzanowski, W.J., 1975.

Discrimination and classification using both binary and continuous variables.

Journal of the American Statistical Association 70, 782–790.

Lavine (1992)

Lavine, M., 1992.

Some aspects of Pólya tree distributions for statistical modelling.

The Annals of Statistics 20, 1222–1235.

Ma and Wong (2011)

Ma, L., Wong, W.H., 2011.

Coupling optional Pólya trees and the two sample problem.

Journal of the American Statistical Association 106, 1553–1565.

Mann et al. (2013)

Mann, G.J., Pupo, G.M., Campain, A.E., Carter, C.D., Schramm, S.J., Pianova, S., Gerega, S.K., DeSilva, C., Lai, K., Wilmott, J.S.e.a., 2013.

BRAF mutations, NRAS mutation, and absence of an immune-related expressed gene profile predict poor outcome in patients with stage III melanoma.

Journal of Investigative Dermatology 133, 509–517.

Marron and Wand (1992)

Marron, J.S., Wand, M.P., 1992.

Exact Mean Integrated Squared Error.

The Annals of Statistics 20, 712–736.

Mauldin et al. (1992)

Mauldin, R.D., Sudderth, W.D., Williams, S.C., 1992.

Pólya trees and random distributions.

The Annals of Statistics 20, 1203–1221.

Mbina et al. (2018)

Mbina, A.M., Nkiet, G.M., Obiang, F.E., 2018.

Variable selection in discriminant analysis for mixed continuous-binary variables and several groups.

Advances in Data Analysis and Classification , 1–23.

McLachlan (1992)

McLachlan, G.J., 1992.

Discriminant analysis and statistical pattern recognition. Wiley.

Ouyang and Liang (2017)

Ouyang, Y., Liang, F., 2017.

An empirical Bayes approach for high dimensional classification.

ArXiV .

Strbenac et al. (2015)

Strbenac, D., Mann, G.J., Ormerod, J.T., Yang, J.Y.H., 2015.

ClassifyR: an R package for performance assessment of classification with applications to transcriptomics.

Bioinformatics 31, 1851–1853.

Sugiyama (2007)

Sugiyama, M., 2007.

Dimensionality reduction of multimodal labeled data by local Fisher discriminant analysis.

Journal of Machine Learning Research 8, 1027–1061.

Teh et al. (2007)

Teh, Y.W., Newman, D., Welling, M., 2007.

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation, in: Advances in Neural Information Processing Systems. MIT Press. volume 19, pp. 1353–1360.

Tibshirani et al. (2003)

Tibshirani, R., Hastie, T., Narasimhan, B., Chu, G., 2003.

Class Prediction by Nearest Shrunken Centroids, with Applications to DNA Microarrays.

Statistical Science 18, 104–117.

Witten and Tibshirani (2011)

Witten, D., Tibshirani, R., 2011.

Penalized classification using Fisher’s linear discriminant.

Journal of Royal Statistical Society Series B 73, 754–772.

Witten (2011)

Witten, D.M., 2011.

Classification and clustering of sequencing data using a Poisson model.

The Annals of Applied Statistics 5, 2493–2518.

Yu et al. (2018)

Yu, W., Ormerod, J.T., Stewart, M., 2018.

Variational discriminant analysis with variable selection.

Submitted to Statistics and Computing .

Zhao and Hanson (2011)

Zhao, L., Hanson, T.E., 2011.

Spatially dependent Polya tree modeling for survival data.

Biometrics 67, 391–403.

Bibliography49

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Abramowitz and Stegun (2002) Abramowitz, M., Stegun, I., 2002. Handbook of mathematical functions. Wiley.
2Ahdesmäki and Strimmer (2010) Ahdesmäki, M., Strimmer, K., 2010. Feature selection in omics prediction problems using CAT score and false discovery rate control. The Annals of Applied Statistics 4, 503–519.
3Benjamini and Speed (2012) Benjamini, Y., Speed, T.P., 2012. Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Research 40. doi: 10.1093/nar/gks 001 . · doi ↗
4Berger and Guglielmi (2001) Berger, J.O., Guglielmi, A., 2001. Bayesian and conditional frequentist testing of a parametric model versus nonparametric alternatives. Journal of the American Statistical Association 96, 174–184.
5Bickel and Levina (2004) Bickel, P.J., Levina, E., 2004. Some Theory for Fisher’s linear discriminant function, ‘Naive Bayes’ and some alternatives when there are many more variables than observations. Bernoulli 10, 989–1010.
6Blackwell (1973) Blackwell, D., 1973. Discreteness of Ferguson selections. The Annals of Statistics 1, 356–358.
7Blei et al. (2017) Blei, D., Kucukelbir, A., Mc Auliffe, J.D., 2017. Variational inference: a review for statisticians. Journal of the American Statistical Society 112, 859–877.
8Box and Cox (1964) Box, G.E.P., Cox, D.R., 1964. An analysis of transformations. Journal of Royal Statistical Society Series B 26, 211–252.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Taxonomy

Variational Nonparametric Discriminant Analysis

Abstract

keywords:

1 Introduction

2 Discriminant analysis with variable selection

2.1 Priors for ρy\rho_{y}ρy​ and γ{\boldsymbol{\gamma}}γ

2.2 Priors for unknown distributions

3 Model inference

4 The smoothing parameter cjc_{j}cj​

5 Numerical results

5.1 Simulation Study

5.2 Application to gene expression datasets

6 Discussion

Funding information and conflict of interest

Appendix A Derivation for algorithm in Table 2

References

2.1 Priors for $\rho_{y}$ and ${\boldsymbol{\gamma}}$

4 The smoothing parameter $c_{j}$