A score function for Bayesian cluster analysis

John Noble; {\L}ukasz Rajkowski

arXiv:1905.10209·stat.OT·May 27, 2019

A score function for Bayesian cluster analysis

John Noble, {\L}ukasz Rajkowski

PDF

Open Access

TL;DR

This paper introduces a parameter-free score function for Bayesian clustering that balances within-cluster variance and between-cluster entropy, aiding in selecting the optimal number of clusters.

Contribution

The proposed score function is a novel, parameter-free tool for Bayesian clustering that improves cluster number selection in existing methods.

Findings

01

Effective in choosing the number of clusters

02

Balances variance and entropy considerations

03

Applicable to hierarchical and K-means clustering

Abstract

We propose a score function for Bayesian clustering. The function is parameter free and captures the interplay between the within cluster variance and the between cluster entropy of a clustering. It can be used to choose the number of clusters in well-established clustering methods such as hierarchical clustering or $K$ -means algorithm.

Equations104

\mathcal{D}(\mathbf{x},\mathcal{I}):=-\frac{1}{2}\sum_{I\in\mathcal{I}}\frac{|I|}{n}\ln\det\Big{(}\frac{\hat{\mathbf{V}}_{\mathbf{x}}}{|I|}+\hat{\mathbf{V}}_{\mathbf{x}}(I)\Big{)}+\sum_{I\in\mathcal{I}}\frac{|I|}{n}\ln\frac{|I|}{n}.

\mathcal{D}(\mathbf{x},\mathcal{I}):=-\frac{1}{2}\sum_{I\in\mathcal{I}}\frac{|I|}{n}\ln\det\Big{(}\frac{\hat{\mathbf{V}}_{\mathbf{x}}}{|I|}+\hat{\mathbf{V}}_{\mathbf{x}}(I)\Big{)}+\sum_{I\in\mathcal{I}}\frac{|I|}{n}\ln\frac{|I|}{n}.

\mathcal{D}_{\Sigma}(\mathbf{x},\mathcal{I}):=-\frac{1}{2}\sum_{I\in\mathcal{I}}\frac{|I|}{n}\ln\Big{|}\frac{\Sigma}{|I|}+\hat{\mathbf{V}}_{\mathbf{x}}(I)\Big{|}+\sum_{I\in\mathcal{I}}\frac{|I|}{n}\ln\frac{|I|}{n}.

\mathcal{D}_{\Sigma}(\mathbf{x},\mathcal{I}):=-\frac{1}{2}\sum_{I\in\mathcal{I}}\frac{|I|}{n}\ln\Big{|}\frac{\Sigma}{|I|}+\hat{\mathbf{V}}_{\mathbf{x}}(I)\Big{|}+\sum_{I\in\mathcal{I}}\frac{|I|}{n}\ln\frac{|I|}{n}.

v^{t}\left(\sum_{i\in I}(x_{i}-\overline{bx_{I}})(x_{i}-\overline{bx_{I}})^{t}\right)v=\sum_{i\in I}\big{(}v^{t}(x_{i}-\overline{bx_{I}})\big{)}^{2}\geq 0

v^{t}\left(\sum_{i\in I}(x_{i}-\overline{bx_{I}})(x_{i}-\overline{bx_{I}})^{t}\right)v=\sum_{i\in I}\big{(}v^{t}(x_{i}-\overline{bx_{I}})\big{)}^{2}\geq 0

\begin{split}\mathcal{D}\big{(}L(\mathbf{x}),\mathcal{I}\big{)}&=-\frac{1}{2}\sum_{I\in\mathcal{I}}\frac{|I|}{n}\ln\Big{|}\frac{1}{|I|}A\hat{\mathbf{V}}_{\mathbf{x}}A^{t}+\frac{1}{|I|}\sum_{i\in I}A(x_{i}-\overline{\mathbf{x}_{I}})(x_{i}-\overline{\mathbf{x}_{I}})^{t}A^{t}\Big{|}+\sum_{I\in\mathcal{I}}\frac{|I|}{n}\ln\frac{|I|}{n}=\\ &=-\frac{1}{2}\sum_{I\in\mathcal{I}}\frac{|I|}{n}\ln\Big{|}A\big{(}\frac{1}{|I|}\hat{\mathbf{V}}_{\mathbf{x}}+\frac{1}{|I|}\sum_{i\in I}(x_{i}-\overline{\mathbf{x}_{I}})(x_{i}-\overline{\mathbf{x}_{I}})^{t}\big{)}A^{t}\Big{|}+\sum_{I\in\mathcal{I}}\frac{|I|}{n}\ln\frac{|I|}{n}=\\ &=-\frac{1}{2}\sum_{I\in\mathcal{I}}\frac{|I|}{n}\ln\Big{(}|A|\Big{|}\frac{1}{|I|}\hat{\mathbf{V}}_{\mathbf{x}}+\frac{1}{|I|}\sum_{i\in I}(x_{i}-\overline{\mathbf{x}_{I}})(x_{i}-\overline{\mathbf{x}_{I}})^{t}\Big{|}|A^{t}|\Big{)}+\sum_{I\in\mathcal{I}}\frac{|I|}{n}\ln\frac{|I|}{n}=\\ &=\mathcal{D}(\mathbf{x},\mathcal{I})-\ln|A|,\end{split}

\begin{split}\mathcal{D}\big{(}L(\mathbf{x}),\mathcal{I}\big{)}&=-\frac{1}{2}\sum_{I\in\mathcal{I}}\frac{|I|}{n}\ln\Big{|}\frac{1}{|I|}A\hat{\mathbf{V}}_{\mathbf{x}}A^{t}+\frac{1}{|I|}\sum_{i\in I}A(x_{i}-\overline{\mathbf{x}_{I}})(x_{i}-\overline{\mathbf{x}_{I}})^{t}A^{t}\Big{|}+\sum_{I\in\mathcal{I}}\frac{|I|}{n}\ln\frac{|I|}{n}=\\ &=-\frac{1}{2}\sum_{I\in\mathcal{I}}\frac{|I|}{n}\ln\Big{|}A\big{(}\frac{1}{|I|}\hat{\mathbf{V}}_{\mathbf{x}}+\frac{1}{|I|}\sum_{i\in I}(x_{i}-\overline{\mathbf{x}_{I}})(x_{i}-\overline{\mathbf{x}_{I}})^{t}\big{)}A^{t}\Big{|}+\sum_{I\in\mathcal{I}}\frac{|I|}{n}\ln\frac{|I|}{n}=\\ &=-\frac{1}{2}\sum_{I\in\mathcal{I}}\frac{|I|}{n}\ln\Big{(}|A|\Big{|}\frac{1}{|I|}\hat{\mathbf{V}}_{\mathbf{x}}+\frac{1}{|I|}\sum_{i\in I}(x_{i}-\overline{\mathbf{x}_{I}})(x_{i}-\overline{\mathbf{x}_{I}})^{t}\Big{|}|A^{t}|\Big{)}+\sum_{I\in\mathcal{I}}\frac{|I|}{n}\ln\frac{|I|}{n}=\\ &=\mathcal{D}(\mathbf{x},\mathcal{I})-\ln|A|,\end{split}

-\sum_{I\in\mathcal{I}}\frac{|I|}{n}\ln\Big{|}\frac{\Sigma}{|I|}\Big{|}=d\sum_{I\in\mathcal{I}}\frac{|I|}{n}\ln\frac{|I|}{n}+d\ln n-\ln|\Sigma|.

-\sum_{I\in\mathcal{I}}\frac{|I|}{n}\ln\Big{|}\frac{\Sigma}{|I|}\Big{|}=d\sum_{I\in\mathcal{I}}\frac{|I|}{n}\ln\frac{|I|}{n}+d\ln n-\ln|\Sigma|.

\begin{array}[]{rcl}\bm{p}=(p_{i})_{i=1}^{m}&\sim&\nu\\ \bm{\theta}=(\theta_{i})_{i=1}^{m}&\stackrel{{\scriptstyle\textrm{iid}}}{{\sim}}&\pi\\ \mathbf{x}=(x_{1},\ldots,x_{n})\,|\,\bm{p},\bm{\theta}&\stackrel{{\scriptstyle\textrm{iid}}}{{\sim}}&\sum_{i=1}^{m}p_{i}G_{\theta_{i}}.\end{array}

\begin{array}[]{rcl}\bm{p}=(p_{i})_{i=1}^{m}&\sim&\nu\\ \bm{\theta}=(\theta_{i})_{i=1}^{m}&\stackrel{{\scriptstyle\textrm{iid}}}{{\sim}}&\pi\\ \mathbf{x}=(x_{1},\ldots,x_{n})\,|\,\bm{p},\bm{\theta}&\stackrel{{\scriptstyle\textrm{iid}}}{{\sim}}&\sum_{i=1}^{m}p_{i}G_{\theta_{i}}.\end{array}

\begin{array}[]{rcl}\Lambda&\sim&\mathcal{W}^{-1}(\eta_{0}+d+1,\eta_{0}\Sigma_{0})\\ \mu\,|\,\Lambda&\sim&\mathcal{N}(\mu_{0},\Lambda/\kappa_{0})\end{array}

\begin{array}[]{rcl}\Lambda&\sim&\mathcal{W}^{-1}(\eta_{0}+d+1,\eta_{0}\Sigma_{0})\\ \mu\,|\,\Lambda&\sim&\mathcal{N}(\mu_{0},\Lambda/\kappa_{0})\end{array}

\begin{array}[]{rl}\mathbb{E}\,\Lambda&=\Sigma_{0},\\ \mathbf{V}(\mu)&=\mathbb{E}\,\mathbf{V}(\mu\,|\,\Lambda)+\mathbf{V}\mathbb{E}\,(\mu\,|\,\Lambda)=\mathbb{E}\,\Lambda/\kappa_{0}+\mathbf{V}(\mu_{0})=\Sigma_{0}/\kappa_{0},\end{array}

\begin{array}[]{rl}\mathbb{E}\,\Lambda&=\Sigma_{0},\\ \mathbf{V}(\mu)&=\mathbb{E}\,\mathbf{V}(\mu\,|\,\Lambda)+\mathbf{V}\mathbb{E}\,(\mu\,|\,\Lambda)=\mathbb{E}\,\Lambda/\kappa_{0}+\mathbf{V}(\mu_{0})=\Sigma_{0}/\kappa_{0},\end{array}

\begin{array}[]{rcl}\bm{p}=(p_{i})_{i=1}^{m}&\sim&\nu\\ \bm{\theta}=(\theta_{i})_{i=1}^{m}&\stackrel{{\scriptstyle\textrm{iid}}}{{\sim}}&\pi\\ \bm{\phi}=(\phi_{1},\ldots,\phi_{n})\,|\,\bm{p},\bm{\theta}&\stackrel{{\scriptstyle\textrm{iid}}}{{\sim}}&\sum_{i=1}^{m}p_{i}\delta_{\theta_{i}}\\ x_{i}\,|\,\bm{p},\bm{\theta},\bm{\phi}&\sim&G_{\phi_{i}}\quad\textrm{ independently for all $i\leq n$.}\end{array}

\begin{array}[]{rcl}\bm{p}=(p_{i})_{i=1}^{m}&\sim&\nu\\ \bm{\theta}=(\theta_{i})_{i=1}^{m}&\stackrel{{\scriptstyle\textrm{iid}}}{{\sim}}&\pi\\ \bm{\phi}=(\phi_{1},\ldots,\phi_{n})\,|\,\bm{p},\bm{\theta}&\stackrel{{\scriptstyle\textrm{iid}}}{{\sim}}&\sum_{i=1}^{m}p_{i}\delta_{\theta_{i}}\\ x_{i}\,|\,\bm{p},\bm{\theta},\bm{\phi}&\sim&G_{\phi_{i}}\quad\textrm{ independently for all $i\leq n$.}\end{array}

P_{ν, n} (I) = \frac{α ^{∣ I ∣}}{α ^{(n)}} I \in I \prod (∣ I ∣ - 1)!,

P_{ν, n} (I) = \frac{α ^{∣ I ∣}}{α ^{(n)}} I \in I \prod (∣ I ∣ - 1)!,

P_{ν, n} (I) = \int_{Δ^{m}} ψ : I \to 1 - 1 [m] \sum I \in I \prod p_{ψ (I)}^{∣ I ∣} d ν (p)

P_{ν, n} (I) = \int_{Δ^{m}} ψ : I \to 1 - 1 [m] \sum I \in I \prod p_{ψ (I)}^{∣ I ∣} d ν (p)

\begin{array}[]{rcl}\mathcal{I}&\sim&\mathcal{P}_{\nu,n}\\ \mathbf{x}_{I}:=(x_{i})_{i\in I}\,|\,\mathcal{I}&\sim&f_{|I|}\quad\textrm{ independently for all $I\in\mathcal{I}$}\end{array}

\begin{array}[]{rcl}\mathcal{I}&\sim&\mathcal{P}_{\nu,n}\\ \mathbf{x}_{I}:=(x_{i})_{i\in I}\,|\,\mathcal{I}&\sim&f_{|I|}\quad\textrm{ independently for all $I\in\mathcal{I}$}\end{array}

f_{k} (u_{1}, \dots, u_{k}) := \int_{Θ} π (θ) i = 1 \prod k g_{θ} (u_{i}) d θ .

f_{k} (u_{1}, \dots, u_{k}) := \int_{Θ} π (θ) i = 1 \prod k g_{θ} (u_{i}) d θ .

f (x ∣ I) := I \in I \prod f_{∣ I ∣} (x_{I}) .

f (x ∣ I) := I \in I \prod f_{∣ I ∣} (x_{I}) .

\begin{array}[]{rcl}\mathcal{I}&\sim&\mathcal{P}_{\nu,n}\\ \mathbf{x}\,|\,\mathcal{I}&\sim&f(\cdot\,|\,\mathcal{I}).\end{array}

\begin{array}[]{rcl}\mathcal{I}&\sim&\mathcal{P}_{\nu,n}\\ \mathbf{x}\,|\,\mathcal{I}&\sim&f(\cdot\,|\,\mathcal{I}).\end{array}

f_{k}(\mathbf{u})=\frac{|\eta_{0}\Sigma_{0}|^{\nu_{0}/2}\kappa_{0}^{1/2}\Gamma_{d}\big{(}{\nu_{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}k}\over 2}\big{)}}{\pi^{d{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}k}/2}\kappa_{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}k}^{1/2}\Gamma_{d}\big{(}{\nu_{0}\over 2}\big{)}}\cdot\det\left(\Sigma(\mathbf{u})\right)^{-\nu_{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}k}/2},

f_{k}(\mathbf{u})=\frac{|\eta_{0}\Sigma_{0}|^{\nu_{0}/2}\kappa_{0}^{1/2}\Gamma_{d}\big{(}{\nu_{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}k}\over 2}\big{)}}{\pi^{d{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}k}/2}\kappa_{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}k}^{1/2}\Gamma_{d}\big{(}{\nu_{0}\over 2}\big{)}}\cdot\det\left(\Sigma(\mathbf{u})\right)^{-\nu_{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}k}/2},

\nu_{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}k}=\eta_{0}+d+1+{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}k},\ \kappa_{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}k}=\kappa_{0}+{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}k}\quad\textrm{and}

\nu_{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}k}=\eta_{0}+d+1+{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}k},\ \kappa_{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}k}=\kappa_{0}+{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}k}\quad\textrm{and}

\Sigma(\mathbf{u})=\eta_{0}\Sigma_{0}+\sum_{i=1}^{k}(u_{i}-\overline{\mathbf{u}})(u_{i}-\overline{\mathbf{u}_{I}})^{t}+\frac{\kappa_{0}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}k}}{\kappa_{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}k}}}(\overline{\mathbf{u}}-\mu_{0})(\overline{\mathbf{u}}-\mu_{0})^{t}.

\Sigma(\mathbf{u})=\eta_{0}\Sigma_{0}+\sum_{i=1}^{k}(u_{i}-\overline{\mathbf{u}})(u_{i}-\overline{\mathbf{u}_{I}})^{t}+\frac{\kappa_{0}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}k}}{\kappa_{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}k}}}(\overline{\mathbf{u}}-\mu_{0})(\overline{\mathbf{u}}-\mu_{0})^{t}.

V_{P} (A) = A \in A \sum P (A) ln ∣ V_{P} (A) ∣, H_{P} (A) = - A \in A \sum P (A) ln P (A),

V_{P} (A) = A \in A \sum P (A) ln ∣ V_{P} (A) ∣, H_{P} (A) = - A \in A \sum P (A) ln P (A),

\overline{Δ}_{P} (A) = - \frac{1}{2} V_{P} (A) - H_{P} (A)

\overline{Δ}_{P} (A) = - \frac{1}{2} V_{P} (A) - H_{P} (A)

\overline{Δ}_{P} (A) = - \frac{1}{2} A \in A \sum P (A) ln ∣ V_{P} (A) ∣ + A \in A \sum P (A) ln P (A)

\overline{Δ}_{P} (A) = - \frac{1}{2} A \in A \sum P (A) ln ∣ V_{P} (A) ∣ + A \in A \sum P (A) ln P (A)

(2 α n^{a})^{- \frac{1}{n ^{b}}} < a_{n}^{- \frac{1}{n ^{b}}} < a_{n}^{b_{n} - β} < a_{n}^{\frac{1}{n ^{b}}} < (2 α n^{a})^{\frac{1}{n ^{c}}}

(2 α n^{a})^{- \frac{1}{n ^{b}}} < a_{n}^{- \frac{1}{n ^{b}}} < a_{n}^{b_{n} - β} < a_{n}^{\frac{1}{n ^{b}}} < (2 α n^{a})^{\frac{1}{n ^{c}}}

n Γ (x_{n}) \approx (2 π x_{n} (\frac{x _{n}}{e})^{x_{n}})^{1/ n} = (2 π x_{n})^{1/ n} (\frac{x _{n}}{e})^{x_{n} / n} \approx (λ \frac{n}{e})^{λ}

n Γ (x_{n}) \approx (2 π x_{n} (\frac{x _{n}}{e})^{x_{n}})^{1/ n} = (2 π x_{n})^{1/ n} (\frac{x _{n}}{e})^{x_{n} / n} \approx (λ \frac{n}{e})^{λ}

n Γ_{d} (x_{n}) = n π^{d (d - 1) /4} j = 1 \prod d n Γ (x_{n} - \frac{j - 1}{2}) \approx (λ \frac{n}{e})^{λ d} .

n Γ_{d} (x_{n}) = n π^{d (d - 1) /4} j = 1 \prod d n Γ (x_{n} - \frac{j - 1}{2}) \approx (λ \frac{n}{e})^{λ d} .

n Γ_{d} (\frac{∣ J _{n}^{A} ∣ + n _{0}}{2}) \approx a.s. (\frac{P ( A )}{2} \cdot \frac{n}{e})^{P (A) d /2} .

n Γ_{d} (\frac{∣ J _{n}^{A} ∣ + n _{0}}{2}) \approx a.s. (\frac{P ( A )}{2} \cdot \frac{n}{e})^{P (A) d /2} .

n A \in A \prod Γ_{d} (\frac{∣ J _{n}^{A} ∣ + n _{0}}{2}) \approx a.s. (A \in A \prod P (A)^{P (A)})^{d /2} (\frac{n}{2 e})^{d /2} .

n A \in A \prod Γ_{d} (\frac{∣ J _{n}^{A} ∣ + n _{0}}{2}) \approx a.s. (A \in A \prod P (A)^{P (A)})^{d /2} (\frac{n}{2 e})^{d /2} .

(x_{i} - \overline{x_{A}}) (x_{i} - \overline{x_{A}})^{t} /∣ J_{n}^{A} ∣ \approx a.s. V_{P} (A) for A \in A

(x_{i} - \overline{x_{A}}) (x_{i} - \overline{x_{A}})^{t} /∣ J_{n}^{A} ∣ \approx a.s. V_{P} (A) for A \in A

\begin{split}\big{|}\Sigma(\mathbf{X}_{J^{\mathcal{A}}_{n}})\big{|}/|J^{A}_{n}|^{d}&=\Big{|}\Sigma_{0}/|J^{A}_{n}|+\sum_{i\in J^{A}_{n}}(x_{i}-\overline{\mathbf{x}_{A}})(x_{i}-\overline{\mathbf{x}_{A}})^{t}/|J^{A}_{n}|+\frac{k_{0}}{k_{0}+{|J^{A}_{n}|}}(\overline{\mathbf{x}_{A}}-\mu_{0})(\overline{\mathbf{x}_{A}}-\mu_{0})^{t}\Big{|}\stackrel{{\scriptstyle\textnormal{a.s.}}}{{\approx}}\\ &\stackrel{{\scriptstyle\textnormal{a.s.}}}{{\approx}}\Big{|}\sum_{i\in J^{A}_{n}}(x_{i}-\overline{\mathbf{x}_{A}})(x_{i}-\overline{\mathbf{x}_{A}})^{t}/|J^{A}_{n}|\Big{|}\stackrel{{\scriptstyle\textnormal{a.s.}}}{{\approx}}|\mathbf{V}_{P}(A)|\\ \end{split}

\begin{split}\big{|}\Sigma(\mathbf{X}_{J^{\mathcal{A}}_{n}})\big{|}/|J^{A}_{n}|^{d}&=\Big{|}\Sigma_{0}/|J^{A}_{n}|+\sum_{i\in J^{A}_{n}}(x_{i}-\overline{\mathbf{x}_{A}})(x_{i}-\overline{\mathbf{x}_{A}})^{t}/|J^{A}_{n}|+\frac{k_{0}}{k_{0}+{|J^{A}_{n}|}}(\overline{\mathbf{x}_{A}}-\mu_{0})(\overline{\mathbf{x}_{A}}-\mu_{0})^{t}\Big{|}\stackrel{{\scriptstyle\textnormal{a.s.}}}{{\approx}}\\ &\stackrel{{\scriptstyle\textnormal{a.s.}}}{{\approx}}\Big{|}\sum_{i\in J^{A}_{n}}(x_{i}-\overline{\mathbf{x}_{A}})(x_{i}-\overline{\mathbf{x}_{A}})^{t}/|J^{A}_{n}|\Big{|}\stackrel{{\scriptstyle\textnormal{a.s.}}}{{\approx}}|\mathbf{V}_{P}(A)|\\ \end{split}

n ∣Σ (X_{J_{n}^{A}}) ∣^{- (∣ J_{n}^{A} ∣ + n_{0}) /2} \approx a.s. (P (A)^{P (A)})^{- d /2} n^{- d P (A) /2} ∣ V_{P} (A) ∣^{- P (A) /2}

n ∣Σ (X_{J_{n}^{A}}) ∣^{- (∣ J_{n}^{A} ∣ + n_{0}) /2} \approx a.s. (P (A)^{P (A)})^{- d /2} n^{- d P (A) /2} ∣ V_{P} (A) ∣^{- P (A) /2}

n A \in A \prod ∣Σ (X_{J_{n}^{A}}) ∣^{- (∣ J_{n}^{A} ∣ + n_{0}) /2} \approx a.s. (A \in A \prod P (A)^{P (A)})^{- d /2} n^{- d /2} A \in A \prod ∣ V_{P} (A) ∣^{- P (A) /2}

n A \in A \prod ∣Σ (X_{J_{n}^{A}}) ∣^{- (∣ J_{n}^{A} ∣ + n_{0}) /2} \approx a.s. (A \in A \prod P (A)^{P (A)})^{- d /2} n^{- d /2} A \in A \prod ∣ V_{P} (A) ∣^{- P (A) /2}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Methods and Mixture Models · Advanced Clustering Algorithms Research · Data Management and Algorithms

Full text

A score function for Bayesian cluster analysis

John Noble

Łukasz Rajkowski

Abstract

We propose a score function for Bayesian clustering. The function is parameter free and captures the interplay between the within cluster variance and the between cluster entropy of a clustering. It can be used to choose the number of clusters in well-established clustering methods such as hierarchical clustering or $K$ -means algorithm.

1 Introduction

Many clustering methods generate a family of clusterings that depend on some user-defined parameters. The most prominent example is the $K$ -means algorithm, where the investigator has to specify the number of clusters. Similarly, in hierarchical clustering, a whole family of clusterings is obtained, starting from the finest partition into singletons and ending in the coarsest clustering, i.e. a single cluster. Again, the investigator chooses the number of clusters based on the dendrogram.

All these methods come with a variety of suggestions how to choose the optimal number of clusters. Some of these are rather heuristic in nature, while others have deep theoretical foundations. For the $K$ -means algorithm these include the elbow method or average silhouette method (Rousseeuw (1987)). Another solution is to use a score statistic (a function which is intended to measure the quality of a clustering) and among different clusterings proposed by a given method choose the one that maximises the score statistic. Constructing score statistics is not a trivial task; one of the most popular choices is the gap statistic (Tibshirani et al. (2001)).

In this article we propose a new score statistic. It is derived as a limit of the first order approximation to the posterior probability (up to the norming constant) in a Nonparametric Bayesian Mixture Model with the inverse Wishart distribution as a base measure for the within group covariance matrices and the Gaussian distribution as a base measure for the cluster means and the component measure. In order to derive the limit we assume that the data is an independent sample from some ‘input’ probability distribution on the observation space; this gives a method of assessing the compatibility of the partitions of the observation space to the input distribution. The score function is obtained by taking the empirical measure as the input distribution and tweaking it slightly so that it is well defined on all possible data clusterings.

1.1 Contribution and Results

Our main contribution is the formulation of a novel score function for clusterings, which is motivated theoretically and performs well on analysed datasets. Suppose that we have a sequence of observations $x_{1},\ldots,x_{n}\in\mathbb{R}^{d}$ and we believe that it consists of several groups and within every group the data is distributed according to some Gaussian distribution (with unknown mean and covariance matrix). The goal is to construct a simple function that measures how well a given clustering of the dataset corresponds to the assumption of being Gaussially distributed within clusters. Our proposition is the following: for $I\subset[n]$ we define $\overline{\mathbf{x}_{I}}=\frac{1}{|I|}\sum_{i\in I}x_{i}$ and $\hat{\mathbf{V}}_{\mathbf{x}}(I)=\frac{1}{|I|}\sum_{i\in I}(x_{i}-\overline{\mathbf{x}_{[n]}})(x_{i}-\overline{\mathbf{x}_{[n]}})^{t}$ and for the notational simplicity denote $\hat{\mathbf{V}}_{\mathbf{x}}:=\hat{\mathbf{V}}_{\mathbf{x}}([n])$ . For $\mathbf{x}=(x_{1},\ldots,x_{n})$ and $\mathcal{I}$ – a partition of $[n]=\{1,2,\ldots,n\}$ let

[TABLE]

It should be noted that if $\mathbf{x}$ is a realisation of a random independent sample $X_{1},\ldots,X_{n}$ from some distribution $P$ on $\mathcal{X}$ , then the components of the formula (1.1) can be treated as empirical estimates of relevant probabilities or the conditional covariance matrices. This is actually how (1.1) is obtained; we investigate the details in Section 3. This remark may be also convenient when dealing with large datasets where the exact computation of (1.1) could be time consuming. In such case we can approximate the variance components of (1.1) by using the random samples from clusters.

2 Score functions and the main formula

2.1 Basic definitions

We start our presentation with a formal definition of a score function, intended to measure the quality of the data clustering.

Notation.

For $n\in\mathbb{N}$ let $[n]=\{1,\ldots,n\}$ and let $\Pi_{n}$ be the set of all partitions of $[n]$ .

Let $\mathcal{X}=\mathbb{R}^{d}$ be the observation space. Let $\mathcal{O}=\bigcup_{n=1}^{\infty}\mathcal{X}^{n}\times\Pi_{n}$ be the set of all possible finite sequences of observations and their partitions and let $\overline{\mathbb{R}}=\mathbb{R}\cup\{-\infty,\infty\}$ .

Definition.

A clustering score function is any function $\mathcal{S}\colon\mathcal{O}\to\overline{\mathbb{R}}$ .

Definition.

Let $\mathcal{S}$ be a score function and let $\mathcal{F}$ be a family of functions from $\mathcal{X}$ to $\mathcal{X}$ . We say that $\mathcal{S}$ is robust to $\mathcal{F}$ if for every $\mathbf{x}=(x_{1},\ldots,x_{n})\in\mathcal{X}^{n}$ and $\mathcal{I},\mathcal{J}\in\Pi_{n}$ and every $f\in\mathcal{F}$ we have $\mathcal{S}(\mathbf{x},\mathcal{I})\leq\mathcal{S}(\mathbf{x},\mathcal{J})$ if and only if $\mathcal{S}(f(\mathbf{x}),\mathcal{I})\leq\mathcal{S}(f(\mathbf{x}),\mathcal{J})$ , where $f(\mathbf{x})=\big{(}f(x_{1}),\ldots,f(x_{n})\big{)}$ .

Hence robustness to $\mathcal{F}$ means that if we apply any function $f\in\mathcal{F}$ to all observations, the optimal clustering indicated by the score function will not alter. If no prior knowledge about the clustering structure is available, a natural demand from a score function is to be robust to linear isomorphisms of $\mathcal{X}$ . In particular, it should be robust to scaling of the axes since it would be strange if the result of applying the score function would depend on the units used to measure the observation. For the similar reasons, we expect a good score function to be robust to translations.

Note, that on the other hand the robustness to all linear transformation would be undesirable – in particular, moving all points to the origin is a linear transformation and we do not expect any clusters to be seen after applying it.

Notation.

Let $\mathcal{A}$ and $\mathcal{B}$ be two partitions of the same set. We say that $\mathcal{A}$ is finer than $\mathcal{B}$ if for every $A\in\mathcal{A}$ there exist $B\in\mathcal{B}$ such that $A\subset B$ . Equivalently, we say that $\mathcal{B}$ is coarser than $\mathcal{A}$ and we write $\mathcal{A}\preceq\mathcal{B}$ .

Definition.

Let $\mathcal{S}$ be a clustering score function. We say that it is non-increasing if for every $\mathbf{x}\in\mathcal{X}^{n}$ and $\mathcal{I},\mathcal{J}\in\Pi_{n}$ such that $\mathcal{I}\preceq\mathcal{J}$ we have $\mathcal{S}(\mathbf{x},\mathcal{I})\leq\mathcal{S}(\mathbf{x},\mathcal{J})$ . If $-\mathcal{S}$ is non-increasing then $\mathcal{S}$ is non-decreasing.

Clearly, no non-decreasing score function would be good for clustering purposes as it would assign the highest score to the clustering into one full cluster, regardless of the data. Similarly, a non-increasing function gives the highest score to the partition of singletons. It seems desirable for this two tendencies to interplay and it is theoretically appealing to find increasing and decreasing parts in a given score function.

2.2 Properties of the $\mathcal{D}$ score function

Notation.

To facilitate the notation in the remaining part of the text we use $|\Sigma|$ to denote the determinant of a square matrix $\Sigma$ .

Definition.

With the notation presented in Section 1.1 we define

[TABLE]

and then $\mathcal{D}(\mathbf{x},\mathcal{I})=\mathcal{D}_{\hat{\mathbf{V}}_{x}}(\mathbf{x},\mathcal{I})$ (which is equivalent to (1.1)). Moreover, we use $\mathcal{D}_{0}$ to denote $\mathcal{D}_{\Sigma}$ with $\Sigma$ being a matrix of zeroes.

Property 1.

Let $x_{1},\ldots,x_{n}\in\mathcal{X}$ such that $x_{1},\ldots,x_{n}$ span $\mathcal{X}$ . Let $\mathbf{x}=(x_{1},\ldots,x_{n})$ . Then $|\mathcal{D}(\mathbf{x},\mathcal{I})|<\infty$ for any $\mathcal{I}\in\Pi_{n}$ .

Proof.

For any $v\in\mathbb{R}^{d}$

[TABLE]

and hence $\sum_{i\in I}(x_{i}-\overline{bx_{I}})(x_{i}-\overline{bx_{I}})^{t}$ is non-negative definite. Moreover, it follows from the assumptions that $\hat{\mathbf{V}}_{\mathbf{x}}$ is positive definite. A sum of non-negative and positive definite matrix is positive definite, so its determinant is positive. Therefore all the summands in (1.1) are finite and the proof follows. ∎

Property 2.

The score function $\mathcal{D}$ is robust to translations and linear isomorphisms.

Proof.

It is easy to check that for any $\mathbf{x}\in\mathcal{X}^{n}$ , $\mathcal{I}\in\Pi_{n}$ and any translation $T$ we have $\mathcal{D}(\mathbf{x},\mathcal{I})=\mathcal{D}\big{(}T(\mathbf{x}),\mathcal{I}\big{)}$ and hence robustness to translations.

Let $L\colon\mathcal{X}\to\mathcal{X}$ be a linear automorphism, defined by $L(x)=Ax$ , where $A$ is a $n\times n$ invertible matrix. Then

[TABLE]

which clearly implies robustness to linear isomorphisms. ∎

Property 3.

(a)

$\sum_{I\in\mathcal{I}}\frac{|I|}{n}\ln\frac{|I|}{n}$ * is increasing* 2. (b)

$-\sum_{I\in\mathcal{I}}\frac{|I|}{n}\ln\Big{|}\hat{\mathbf{V}}_{\mathbf{x}}(I)\Big{|}$ * is decreasing* 3. (c)

$-\sum_{I\in\mathcal{I}}\frac{|I|}{n}\ln\Big{|}\frac{\Sigma}{|I|}\Big{|}$ * is increasing*

Proof.

The proof of parts (a) and (b) follow from 6 by taking the empirical measure instead of $P$ . Part (c) follows from (a) because

[TABLE]

∎

3 The derivation

In this section we give the theoretical foundations for considering the function $\mathcal{D}$ as clustering score function. We present a general formulation of a Bayesian Mixture Model and then we concentrate on the case where the data within clusters are distributed as Gaussians.

We analyse the asymptotics of the formula for the (unnormalised) posterior in this model. In this way we concentrate on scoring the partitions of the observation space rather than the data themselves. However, it is easy to switch to the score statistic by considering an empirical counterpart of $P$ instead of $P$ ; this yields $\mathcal{D}_{0}$ (cf. (2.1)). The general form of (2.1) is constructed to prevent the function $\mathcal{D}_{0}$ from assigning an infinite score to clusterings with very small clusters (of size less than the dimension of the observation space); on the other hand when the clusters are large enough, then $\mathcal{D}$ approximates $\mathcal{D}_{0}$ .

3.1 Bayesian Mixture Models

Let $\Theta\subset\mathbb{R}^{p}$ be the parameter space and $\{G_{\theta}\colon\theta\in\Theta\}$ be a family of probability measures on the observation space $\mathbb{R}^{d}$ . Consider a prior distribution $\pi$ on $\Theta$ . Let $\nu$ be a probability distribution on the $m$ -dimensional simplex $\Delta^{m}=\{\bm{p}=(p_{i})_{i=1}^{m}\colon\textrm{$ \sum_{i=1}^{m}p_{i}=1 $and$ p_{i}\geq 0 $for$ i\leq m $}\}$ (where $m\in\mathbb{N}\cup\{\infty\}$ ). Let

[TABLE]

This is a Bayesian Mixture Model. If $G_{\theta}$ a Gaussian distribution for all $\theta\in\Theta$ , we say that (3.1) defines a Bayesian Mixture of Gaussians. In this case a convenient choice of the parameter space is $\Theta=\mathbb{R}^{d}\times\mathcal{S}^{+}_{d}$ , where $\mathcal{S}^{+}_{d}$ is the space of positive definite $d\times d$ matrices. Then for $\theta=(\mu,\Lambda)$ the distribution $G_{\theta}$ is the multivariate normal distribution $\mathcal{N}(\mu,\Lambda)$ . A conjugate prior distribution $\pi$ on $\Theta$ is the Normal-inverse-Wishart distribution, which is given by

[TABLE]

Here $\mathcal{W}^{-1}$ denotes the inverse Wishart distribution and the hyperparameters are $\kappa_{0},\eta_{0}>0$ , $\mu_{0}\in\mathbb{R}^{d}$ and $\Sigma_{0}\in\mathcal{S}^{+}$ . This prior is listed in Gelman et al. (2013) with a slightly different hyperparameters, but we made this modification to obtain

[TABLE]

which gives a nice interpretation of the hyperparameters.

Formula (3.1) can model data clustering; clusters are defined by deciding which $G_{\theta_{i}}$ generated a given data point. In order to formally define the clusters, we need to rewrite (3.1) as

[TABLE]

Then the clusters are the classes of abstraction of the equivalence relation $i\sim j\equiv\phi_{i}=\phi_{j}$ . In this way the distribution $\nu$ on the $m$ dimensional simplex generates a probability distribution $\mathcal{P}_{\nu,n}$ on the partitions of set $[n]$ into at most $m$ subsets.

Example 3.1.

Let $V_{1},V_{2},\ldots\stackrel{{\scriptstyle\textrm{iid}}}{{\sim}}\textnormal{Beta}(1,\alpha)$ , $p_{1}=V_{1}$ , $p_{k}=V_{k}\prod_{i=1}^{k-1}(1-V_{i})$ for $k>1$ . Let $\nu$ to be the distribution of $\bm{p}=(p_{1},p_{2},\ldots)$ . The probability on the space of partitions of $[n]$ that $\nu$ generates is the Generalized Polya Urn Scheme (Blackwell et al. (1973)) also known as the Chinese Restaurant Process (Aldous (1985)) with the probability weight given by

[TABLE]

where $\alpha^{(n)}=\alpha(\alpha+1)\ldots(\alpha+n-1)$ .

Lemma 3.2.

Let $\nu$ be a probability distribution on $\Delta^{m}$ that generates a probability $\mathcal{P}_{\nu,n}$ on the partitions of $[n]$ . Then for every partition $\mathcal{I}$ of $[n]$

[TABLE]

where the ,,middle sum” ranges over all injective functions from $\mathcal{I}$ to $[m]$ (with the convention $[\infty]=\mathbb{N}$ ).

Proof.

If $|\mathcal{I}|>m$ then both sides of (3.6) are 0. We now assume that $|\mathcal{I}|\leq m$ . Let us go back to (3.4) and suppose that the weights $\bm{p}=(p_{i})_{i=1}^{m}$ and the atoms $\bm{\theta}=(\theta_{i})_{i=1}^{m}$ are fixed. We need to know what is the probability that $\bm{\phi}=(\phi_{1},\ldots,\phi_{n})\,|\,\bm{p},\bm{\theta}\stackrel{{\scriptstyle\textrm{iid}}}{{\sim}}\sum_{i=1}^{m}p_{i}\delta_{\theta_{i}}$ induces a partition $\mathcal{I}$ . This would mean that for every $I\in\mathcal{I}$ all the values $\phi_{i}$ for $i\in I$ are equal to $\theta_{j}$ for some $j\leq m$ ; let $j=\psi(I)$ . The values $\psi(I)$ must be different for different $I\in\mathcal{I}$ , otherwise $\mathcal{I}$ would not be generated. The probability of the sequence $(\phi_{1},\ldots,\phi_{n})$ where $\phi_{i}=\theta_{\psi(I)}$ for $i\in I$ is equal to $\prod_{I\in\mathcal{I}}p_{\psi(I)}^{|I|}$ . Since any assignment of clusters to atoms is valid, so for fixed $\bm{p}$ the probability of $\mathcal{I}$ is equal to $\sum_{\psi\colon\mathcal{I}\stackrel{{\scriptstyle 1-1}}{{\to}}[m]}\prod_{I\in\mathcal{I}}p_{\psi(I)}^{|I|}$ . Since $\bm{p}\sim\nu$ is random, we have to integrate it out and (3.6) follows. ∎

Let $\mathcal{P}_{\nu,n}$ be the probability distribution on the space of partitions generated by $\nu$ . We can formulate (3.1) as follows: firstly we generate the partition of observations into clusters, and then for every cluster we sample actual observations from the relevant marginal distribution. Formally, (3.1) is equivalent to

[TABLE]

where for $\theta\sim\pi$ , $k\in\mathbb{N}$ and $\mathbf{u}=(u_{1},\ldots,u_{k})\,|\,\theta\stackrel{{\scriptstyle\textrm{iid}}}{{\sim}}G_{\theta}$ , $f_{k}$ is the marginal density of $\mathbf{u}$ , i.e.

[TABLE]

( $g_{\theta}$ is the density of $G_{\theta}$ ). We stress the fact that the independent sampling on the ‘lower’ level of (3.7) relates to the independence between clusters (conditioned on the random partition); within one cluster the observations are (marginally) dependent. To make the notation more concise we define

[TABLE]

Then (3.7) becomes

[TABLE]

The further analysis requires the exact formula for $f_{k}$ ; in our case it is straightforward to compute since $\pi$ and $G_{\theta}$ are conjugate. We state the result here for the reader’s convenience.

Proposition 1.

Let $\theta=(\mu,\Lambda)$ have the distribution given by (3.2) and let $\mathbf{u}=(u_{1},\ldots,u_{k})\,|\,\theta\stackrel{{\scriptstyle\textrm{iid}}}{{\sim}}\mathcal{N}(\mu,\Lambda)$ . Then the marginal distribution of $\mathbf{u}$ is given by

[TABLE]

where $\Gamma_{d}$ is the multivariate Gamma function and

[TABLE]

Proof.

The proof follows from Murphy (2007), equation (266). ∎

3.2 The Induced Partition

Throughout this section $P$ is some fixed probability distribution on $\mathbb{R}^{d}$ .

Definition 3.3.

We say that a family $\mathcal{A}$ of $P$ -measurable subsets of $\mathbb{R}^{d}$ is a $P$ -partition if

•

$P\left(\bigcup_{A\in\mathcal{A}}A\right)=1$

•

$P(A_{1}\cap A_{2})=0$ for all $A_{1},A_{2}\in\mathcal{A}$ , $A_{1}\neq A_{2}$ .

Notation.

Let $\mathcal{A}$ be a $P$ -partition of the observation space. Let $X_{1},X_{2},\ldots\stackrel{{\scriptstyle\textrm{iid}}}{{\sim}}P$ and for $n\in\mathbb{N}$ let $\mathcal{I}^{\mathcal{A}}_{n}=\{J^{A}_{n}\colon A\in\mathcal{A}\}$ where $J^{A}_{n}=\{i\leq n\colon X_{i}\in A\}$ (if $J^{A}_{n}=\emptyset$ , we do not include it in $\mathcal{I}^{\mathcal{A}}_{n}$ ). We say that $\mathcal{I}^{\mathcal{A}}_{n}$ is induced by $\mathcal{A}$ .

Proposition 2.

Let $\mathcal{A}$ be a $P$ -partition of the observation space. Then $\mathcal{I}^{\mathcal{A}}_{n}$ is almost surely a partition of $[n]$ .

Proof.

The proof is straightforward and therefore omitted. ∎

Let $E_{P}(A)=\mathbb{E}\,_{P}(X\,|\,X\in A)$ and $\mathbf{V}_{P}(A)=\textnormal{Var}_{P}(X\,|\,X\in A)$ , where $X\sim P$ . That means $E_{P}(A)$ is the conditional expected value and $\mathbf{V}_{P}(A)$ is the conditional covariance matrix of $X$ conditioned on the event $X\in A$ . For a family $\mathcal{A}$ of sets with positive $P$ measure let

[TABLE]

where $|\cdot|$ means determinant. Let

[TABLE]

It turns out that basically (3.15) is (modulo constant) the first order approximation to the logarithm of the posterior probability in Bayesian Mixture Model of the data clustering defined by $\mathcal{A}$ , when the data comes as an iid sample from $P$ .

Proposition 3.

$\sqrt[n]{\mathcal{P}_{\nu,n}(\mathcal{I}^{\mathcal{A}}_{n})\cdot f(X_{1:n}\,|\,\mathcal{I}^{\mathcal{A}}_{n})}\approx n\exp\{\overline{\Delta}_{P}(\mathcal{A})\}$ , where

[TABLE]

Proof.

The result follows from 4 and 5. ∎

It should be noted that 3 does not depend on the form of the prior on probability measures. This prior is responsible for the ‘entropy‘ part of (3.16).

The final goal is not to score the partitions of the observation space but clusterings of the data. A natural idea is to replace the distribution $P$ in (3.15) by its empirical counterpart. Let $\hat{P}_{n}=\frac{1}{n}\sum_{i\leq n}\delta_{x_{i}}$ be the empirical probability of $\mathbf{x}$ . This is how $\mathcal{D}_{0}$ is obtained.

The function $\mathcal{D}_{0}$ would not be a good score statistic, because if $\mathcal{J}$ contains a cluster $J$ of size less than $d$ then $\sum_{j\in J}(x_{j}-\overline{x_{J}})(x_{j}-\overline{x_{J}})^{t}$ is singular and hence $\hat{\Delta}_{\mathbf{x}}(\mathcal{J})=\infty$ . To circumvent this, one could add some positive definite matrix to the within-group covariance matrix – in this way the relevant determinant will always be greater than zero. Since we would like to avoid any arbitrary constants in the score function, a natural idea is to use the covariance matrix of the whole dataset, $\hat{\mathbf{V}}_{\mathbf{x}}=\sum_{i\leq n}(x_{i}-\overline{x})(x_{i}-\overline{x})^{t}$ . This operation is also motivated by considering the adaptive model, where the strength of prior distribution is increasing linearly with the number of observations. The details of this approach are given in Section 4. On the other hand, we do not want this modification to affect $\hat{\Delta}_{\mathbf{x}}$ significantly when the sizes of clusters are large and the empirical covariance matrices are good estimates of theoretical ones. Therefore we decide to decrease the importance of the modification linearly with the cluster size. This gives (1.1), which is a well defined score statistic.

3.2.1 Auxiliary propositions

Proposition 4.

Let $P$ be a probability distribution on $\mathbb{R}^{d}$ and let $\mathcal{A}$ be a finite $P$ -partition of the observation space. Then $\lim_{n\to\infty}\sqrt[n]{f(X_{1:n}\,|\,\mathcal{I}^{\mathcal{A}}_{n})}\stackrel{{\scriptstyle\textnormal{a.s.}}}{{=}}\prod_{A\in\mathcal{A}}|\mathbf{V}_{P}(A)|^{P(A)}$

Before we present the proof of 4, we formulate an auxiliary lemma that concerns the asymptotics of the function $\Gamma_{d}$ .

Notation.

If $(a_{n})_{n=1}^{\infty}$ and $(b_{n})_{n=1}^{\infty}$ are real sequences, we write $a_{n}\approx b_{n}$ if $\lim_{n\to\infty}\frac{a_{n}}{b_{n}}=1$ . We write $a_{n}=o(b_{n})$ if $\lim_{n\to\infty}\frac{a_{n}}{b_{n}}=0$ . Similarly, if $a,b\colon\mathbb{R}\to\mathbb{R}$ are real functions, we write $a(x)\approx b(x)$ if $\lim_{x\to\infty}\frac{a(x)}{b(x)}=1$ and $a(x)=o\big{(}b(x)\big{)}$ if $\lim_{x\to\infty}\frac{a(x)}{b(x)}=0$ .

Lemma 3.4.

Let $\alpha,\beta,a,b>0$ . If $a_{n}\approx\alpha n^{a}$ and $b_{n}-\beta=o\left(\frac{1}{n^{b}}\right)$ then $a_{n}^{b_{n}}\approx(\alpha n)^{\beta}$ .

Proof.

For sufficiently large $n$ we have $1<a_{n}<2\alpha n^{a}$ and $-\frac{1}{n^{b}}<b_{n}-\beta<\frac{1}{n^{c}}$ , hence

[TABLE]

Left- and right-hand side of (3.17) converge to 1, so $\lim_{n\to\infty}a_{n}^{b_{n}-\beta}=1$ . The proof follows from $\frac{a_{n}^{b_{n}}}{(\alpha n)^{\beta}}=\left(\frac{a_{n}}{\alpha n^{a}}\right)^{\beta}a_{n}^{b_{n}-\beta}$ .

∎

Lemma 3.5.

If $x_{n}\approx\lambda n$ and $x_{n}/n-\lambda=o\big{(}\frac{1}{n^{a}}\big{)}$ for some $a>0$ then $\sqrt[n]{\Gamma_{d}\left({x_{n}}\right)}\approx(\lambda\frac{n}{e})^{\lambda d}$ .

Proof.

Recall Stirling’s formula: $\Gamma(x)\approx\sqrt{2\pi x}(\frac{x}{e})^{x}.$ It follows from 3.4 that

[TABLE]

since $n^{1/n^{a}}\approx 1$ . Note that for fixed $t>0$ we have $(x_{n}-t)\approx\lambda n$ and as a result

[TABLE]

∎

Proof of 4.

Note that $|J^{A}_{n}|$ is a random variable with distribution $\textnormal{Bin}(n,P(A))$ for all $A\in\mathcal{A}$ . Due to Law of Iterated Logarithm we have that almost surely $\big{(}|J^{A}_{n}|/n-P(A)\big{)}=o(n^{-1/2+\varepsilon})$ for any $\varepsilon>0$ and hence the assumptions of 3.5 are almost surely satisfied, so

[TABLE]

Because $\mathcal{A}$ is finite and $\sum_{A\in\mathcal{A}}P(A)=1$ , it means that

[TABLE]

By the strong law of large numbers we have that

[TABLE]

and hence, by (3.13), for $A\in\mathcal{A}$

[TABLE]

Hence $|\Sigma(\mathbf{X}_{J^{\mathcal{A}}_{n}})|\stackrel{{\scriptstyle\textnormal{a.s.}}}{{\approx}}|J^{A}_{n}|^{d}|\mathbf{V}_{P}(A)|\stackrel{{\scriptstyle\textnormal{a.s.}}}{{\approx}}n^{d}P(A)^{d}|\mathbf{V}_{P}(A)|$ . Using the Law of Iterated Logarithm and 3.4 again we get

[TABLE]

which means

[TABLE]

and therefore

[TABLE]

∎

Proposition 5.

Let $P$ be a probability distribution on $\mathbb{R}^{d}$ and let $\mathcal{A}$ be a finite $P$ -partition of the observation space. Let $\mathcal{P}_{\nu,n}$ be a probability distribution on the partitions of $[n]$ , generated by the probability distribution $\nu$ on $\Delta^{\infty}$ . Then $\lim_{n\to\infty}\sqrt[n]{\mathcal{P}_{\nu,n}(\mathcal{I}^{\mathcal{A}}_{n})}\stackrel{{\scriptstyle\textnormal{a.s.}}}{{=}}\prod_{A\in\mathcal{A}}P(A)^{P(A)}$ .

Proof.

The proof is a direct consequence of the Law of Large Numbers and 3.8. ∎

By $\eqref{eq:deltadef}$ , $\overline{\Delta}_{P}$ consists of two components: $\mathcal{V}_{P}$ and $\mathcal{H}_{P}$ . These two behave differently when two clusters are joined; the variance component is increasing whereas the entropy component is decreasing.

Proposition 6.

Let $\mathcal{A}$ be a partition of $\mathbb{R}^{d}$ and let $A,B\in\mathcal{A}$ . Let $\mathcal{C}$ be a partition obtained from $\mathcal{A}$ by joining $A$ and $B$ , i.e. $\mathcal{C}=\mathcal{A}\cup\{A\cup B\}\setminus\{A,B\}$ . Then

(a)

$\mathcal{H}_{P}(\mathcal{A})\geq\mathcal{H}_{P}(\mathcal{C})$ ** 2. (b)

$\mathcal{V}_{P}(\mathcal{A})\leq\mathcal{V}_{P}(\mathcal{C})$ .

Proof.

Let $C=A\sqcup B$ .

Part (a):

[TABLE]

and the proof follows. The last inequality in (3.27) comes from $P(A),P(B)\leq P(C)$ .

Lemma 3.6.

Let $A\cap B=\emptyset,C:=A\cup B$ . Then

[TABLE]

where $\preceq$ is the Löwner partial order, i.e. $M_{1}\preceq M_{2}$ iff $M_{2}-M_{1}$ is non-negative definite.

Proof.

Let $e_{1}(A)=\mathbb{E}\,X{\bf 1}_{A}(X)$ and $e_{2}(A)=\mathbb{E}\,XX^{t}{\bf 1}_{A}(X)$ where $X\sim P$ . Then

[TABLE]

Note that the functions $P,e_{1},e_{2}$ are additive, hence

[TABLE]

The last matrix in (3.30) is clearly non-negative definite and the proof follows. ∎

Theorem 3.7.

(Theorem 2.4.4 in Horn et al. (1990))* The function $\ln\det(\cdot)$ is convex on the space of positive definite matrices.*

Proof of part (b):

[TABLE]

and the proof follows. ∎

Theorem 3.8.

Let $\mathcal{P}_{\nu,n}$ be a probability distribution on the partitions of $[n]$ , generated by the probability distribution $\nu$ on $\Delta^{\infty}$ . Fix $K\in\mathbb{N}$ and consider a sequence of partitions $(\mathcal{I}_{n})_{n\in\mathbb{N}}$ , where $\mathcal{I}_{n}=\{I_{n,1},\ldots,I_{n,K}\}$ is a partition of $[n]$ (it is possible that $I_{n,i}=\emptyset$ for some $i\leq K$ ). Assume that $|I_{n,k}|/n\to\alpha_{k}>0$ for $k\leq K$ . Then

[TABLE]

Proof.

Firstly note that for sufficiently large $n$ we have $|I_{k,n}|\geq 1$ for all $k\leq K$ . Then in (3.6) we sum functions that depend on exactly $K$ coordinates of $\bm{p}$ . Hence we can express $\eqref{eq:pi_to_erp}$ in the form of an integral on the $K$ -dimensional set ${\blacktriangle}^{K}=\{(p_{1},\ldots,p_{K})\colon\sum_{k=1}^{K}p_{k}=1,\forall_{k\leq K}p_{k}\in(0,1)\}$ as

[TABLE]

where $\nu_{K}$ is a measure on ${\blacktriangle}^{K}$ defined by

[TABLE]

for $A\subset{\blacktriangle}^{K}$ , where $[K]=\{1,2,\ldots,K\}$ . Hence

[TABLE]

where $g_{n}(p_{1},\ldots,p_{K})=\prod_{k=1}^{K}p_{k}^{|I_{k,n}|/n}$ and $\|\cdot\|_{n}$ is the norm in $L^{n}({\blacktriangle}^{K},\nu_{K})$ space.

Since $\nu_{K}$ is not a finite measure on ${\blacktriangle}^{K}$ , in the remaining part of the proof we will have to be careful that the functions we are considering belong to the space $L^{n}({\blacktriangle}^{K},\nu_{K})$ for sufficiently large $n$ .

Let $g(p_{1},\ldots,p_{K})=\prod_{k=1}^{K}p_{k}^{\alpha_{k}}$ and let $h(p_{1},\ldots,p_{K})=\prod_{k=1}^{K}p_{k}$ . Note that

[TABLE]

Moreover for $n>1/\min{\alpha_{i}}$ we have $g^{n}(\bm{p})\leq h(\bm{p})$ and therefore $g\in L^{n}({\blacktriangle}^{K},\nu_{K})$ for $n>1/\min{\alpha_{i}}$ . Because $g$ is bounded by 1 we get

[TABLE]

(the fact that $\|g\|_{\infty}=\sup_{{\blacktriangle}^{K}}g=\prod_{k\leq K}\alpha_{k}^{\alpha_{k}}$ follows easily from applying the Lagrange multipliers).

We now prove that $\|g_{n}-g\|_{n}\to 0$ . It is not a straightforward consequence of the pointwise convergence of $g_{n}$ to $g$ since $\nu_{K}$ is not a finite measure on ${\blacktriangle}^{K}$ .

Clearly, $(|I_{k,n}|/n-\alpha_{k}/2)\to\alpha_{k}/2>0$ and hence $\|g_{n}g^{-1/2}-g^{1/2}\|_{\infty}\to 0$ on ${\blacktriangle}^{K}$ .

Let $N\in\mathbb{N}$ be chosen so that for $n>N$ we have $\|g_{n}g^{-1/2}-g^{1/2}\|_{\infty}<\varepsilon$ and $n\alpha_{k}\geq 2$ for $k\leq K$ . Then for $n>N$

[TABLE]

hence $\|g_{n}-g\|_{n}\to 0$ . The result follows from the triangle inequality

[TABLE]

∎

Lemma 3.9.

Let $\alpha_{i}>0$ for $i\leq K$ and $\sum_{i=1}^{K}\alpha_{i}=1$ . Let $g(p_{1},\ldots,p_{K})=\prod_{k=1}^{K}p_{k}^{\alpha_{k}}$ . Then $\sup_{{\blacktriangle}^{K}}g=\prod_{k\leq K}\alpha_{k}^{\alpha_{k}}$ .

Proof.

As $\alpha_{i}>0$ for $i\leq K$ , the function $g$ is continuous and, because ${\blacktriangle}^{K}$ is compact in $\mathbb{R}^{K}$ , it achieves its extreme values. Let $\hat{\bm{p}}=(\hat{p}_{1},\ldots,\hat{p}_{K})\in{\blacktriangle}^{K}$ satisfy $g(\hat{\bm{p}}_{K})=\sup_{{\blacktriangle}^{K}}g$ . Clearly, $\hat{\bm{p}}\in\Delta^{K}$ . Indeed, otherwise $s=\sum_{i=1}^{K}\hat{p}_{i}<1$ , $\hat{\bm{p}}/s\in{\blacktriangle}^{K}$ and $g(\hat{\bm{p}}/s)=g(\hat{\bm{p}})/s>g(\hat{\bm{p}})$ , which contradicts the definition of $\hat{\bm{p}}$ . Since $g$ is nonnegative on $\Delta^{K}$ and it is equal to 0 on the boundary of $\Delta^{K}$ , we know that $\hat{\bm{p}}$ is in the interior of $\Delta^{K}$ . The function $g$ is positive on the interior of $\Delta^{K}$ , so by considering the function $\ln(g)$ and using the Lagrange multipliers, we gat that $\hat{\bm{p}}$ satisfies

[TABLE]

for $i\leq K$ and some $\lambda\in\mathbb{R}$ . Hence $p_{i}$ ’s are proportional to $\alpha_{i}$ ’s, and because $\sum_{i=1}^{K}\alpha_{i}=1$ , we get that $\hat{p_{i}}=\alpha_{i}$ and the proof follows. ∎

4 Adaptive model

We now allow parameters of the model (3.2) to change with the number of observations. More precisely, we perform a substitution $\eta_{0}\mapsto\lambda n=:\eta_{n}$ so that the expected value of the within group precision matrix is fixed and increasingly concentrated on $\Sigma_{0}$ . We investigate the limit formula for the posterior as $n$ goes to infinity. Note that in this case $\Sigma_{|J^{A}_{n}|}/n\to\lambda\Sigma_{0}+\mathbf{V}_{P}(A)$ .

[TABLE]

Proposition 7.

Let $P$ be a probability distribution on $\mathbb{R}^{d}$ and let $\mathcal{A}$ be a finite $P$ -partition of the observation space. Then

[TABLE]

Proof.

Note that $|J^{A}_{n}|$ is a random variable with distribution $\textnormal{Bin}(n,P(A))$ for all $A\in\mathcal{A}$ . Due to Law of Iterated Logarithm we have that almost surely $\big{(}|J^{A}_{n}|/n-P(A)\big{)}=o(n^{-1/2+\varepsilon})$ for any $\varepsilon>0$ and hence the assumptions of 3.5 are almost surely satisfied, so

[TABLE]

Because $\mathcal{A}$ is finite and $\sum_{A\in\mathcal{A}}P(A)=1$ , it means that

[TABLE]

By the strong law of large numbers we have that

[TABLE]

and hence, by (3.13), for $A\in\mathcal{A}$

[TABLE]

Hence $|\Sigma(\mathbf{X}_{J^{\mathcal{A}}_{n}})|\stackrel{{\scriptstyle\textnormal{a.s.}}}{{\approx}}\stackrel{{\scriptstyle\textnormal{a.s.}}}{{\approx}}n^{d}\big{(}P(A)+\lambda\big{)}^{d}|\frac{\lambda}{P(A)+\lambda}\Sigma_{0}+\frac{P(A)}{P(A)+\lambda}\mathbf{V}_{P}(A)|$ . Using the Law of Iterated Logarithm and 3.4 again we get

[TABLE]

and (4.2) follows. ∎

5 Discussion

In this article we proposed a score function that can be used for choosing the number of clusters in popular clustering methods. It is derived as a limit in a Bayesian Mixture Model of Gaussians. We derived some of its properties, though there are some questions that remain unanswered. For example, it is interesting to ask what assumptions on $P$ should be made to ensure that the supremum of possible values of the $\overline{\Delta}$ function is finite.

Bibliography7

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Aldous [1985] David J Aldous. Exchangeability and related topics. In École d’Été de Probabilités de Saint-Flour XIII—1983 , pages 1–198. Springer, 1985.
2Blackwell et al. [1973] David Blackwell, James B Mac Queen, et al. Ferguson distributions via pólya urn schemes. The annals of statistics , 1(2):353–355, 1973.
3Gelman et al. [2013] Andrew Gelman, Hal S Stern, John B Carlin, David B Dunson, Aki Vehtari, and Donald B Rubin. Bayesian data analysis . Chapman and Hall/CRC, 2013.
4Horn et al. [1990] Roger A Horn, Roger A Horn, and Charles R Johnson. Matrix analysis . Cambridge university press, 1990.
5Murphy [2007] Kevin P Murphy. Conjugate bayesian analysis of the gaussian distribution. def , 1(2 σ 𝜎 \sigma 2):16, 2007.
6Rousseeuw [1987] Peter J Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics , 20:53–65, 1987.
7Tibshirani et al. [2001] Robert Tibshirani, Guenther Walther, and Trevor Hastie. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 63(2):411–423, 2001.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

A score function for Bayesian cluster analysis

Abstract

1 Introduction

1.1 Contribution and Results

2 Score functions and the main formula

2.1 Basic definitions

Notation**.**

Definition**.**

Definition**.**

Notation**.**

Definition**.**

2.2 Properties of the D\mathcal{D}D score function

Notation**.**

Definition**.**

Property 1**.**

Proof.

Property 2**.**

Proof.

Property 3**.**

Proof.

3 The derivation

3.1 Bayesian Mixture Models

Example 3.1**.**

Lemma 3.2**.**

Proof.

Proposition 1**.**

Proof.

3.2 The Induced Partition

Definition 3.3**.**

Notation**.**

Proposition 2**.**

Proof.

Proposition 3**.**

Proof.

3.2.1 Auxiliary propositions

Proposition 4**.**

Notation**.**

Lemma 3.4**.**

Proof.

Lemma 3.5**.**

Proof.

Proof of 4.

Proposition 5**.**

Proof.

Proposition 6**.**

Proof.

Lemma 3.6**.**

Proof.

Theorem 3.7**.**

Theorem 3.8**.**

Proof.

Lemma 3.9**.**

Proof.

4 Adaptive model

Proposition 7**.**

Proof.

5 Discussion

Notation.

Definition.

Definition.

Notation.

Definition.

2.2 Properties of the $\mathcal{D}$ score function

Notation.

Definition.

Property 1.

Property 2.

Property 3.

Example 3.1.

Lemma 3.2.

Proposition 1.

Definition 3.3.

Notation.

Proposition 2.

Proposition 3.

Proposition 4.

Notation.

Lemma 3.4.

Lemma 3.5.

Proposition 5.

Proposition 6.

Lemma 3.6.

Theorem 3.7.

Theorem 3.8.

Lemma 3.9.

Proposition 7.