Classification with the matrix-variate-$t$ distribution

Geoffrey Z. Thompson; Ranjan Maitra; William Q. Meeker and; Ashraf Bastawros

arXiv:1907.09565·stat.ME·December 24, 2019

Classification with the matrix-variate-$t$ distribution

Geoffrey Z. Thompson, Ranjan Maitra, William Q. Meeker and, Ashraf Bastawros

PDF

TL;DR

This paper introduces an EM algorithm for classification using matrix-variate-$t$ distributions, effectively modeling dependencies in matrix-valued data for applications like imaging and forensic analysis.

Contribution

It develops a novel EM-based discriminant analysis method specifically for matrix-variate-$t$ distributions, addressing dependence structures in complex data.

Findings

01

Effective on simulated datasets

02

Useful for forensic surface matching

03

Applicable to various imaging modalities

Abstract

Matrix-variate distributions can intuitively model the dependence structure of matrix-valued observations that arise in applications with multivariate time series, spatio-temporal or repeated measures. This paper develops an Expectation-Maximization algorithm for discriminant analysis and classification with matrix-variate $t$ -distributions. The methodology shows promise on simulated datasets or when applied to the forensic matching of fractured surfaces or the classification of functional Magnetic Resonance, satellite or hand gestures images.

Equations57

f (X; M, Σ, Ω)

f (X; M, Σ, Ω)

f (X; ν, M, Σ, Ω) = \frac{Γ _{p} ( \frac{ν + p + q - 1}{2} )}{( π ) ^{\frac{pq}{2}} Γ _{p} ( \frac{ν + p - 1}{2} )} ∣ Ω ∣^{- \frac{p}{2}} ∣ Σ ∣^{- \frac{q}{2}} I_{p} + Σ^{- 1} (X - M) Ω^{- 1} (X - M)^{T}^{- \frac{ν + p + q - 1}{2}} .

f (X; ν, M, Σ, Ω) = \frac{Γ _{p} ( \frac{ν + p + q - 1}{2} )}{( π ) ^{\frac{pq}{2}} Γ _{p} ( \frac{ν + p - 1}{2} )} ∣ Ω ∣^{- \frac{p}{2}} ∣ Σ ∣^{- \frac{q}{2}} I_{p} + Σ^{- 1} (X - M) Ω^{- 1} (X - M)^{T}^{- \frac{ν + p + q - 1}{2}} .

\begin{split}\boldsymbol{S}_{i}^{(t+1)}\doteq\mathbb{E}_{\Theta^{(t)}}(\boldsymbol{S}_{i}|\boldsymbol{X}_{i})&=(\nu^{(t)}+p+q-1)[(\boldsymbol{X}_{i}-\mathbf{M}^{(t)})\boldsymbol{\Omega}^{(t)^{-1}}(\boldsymbol{X}_{i}-\mathbf{M}^{(t)})^{T}+\boldsymbol{\Sigma}^{(t)}]^{-1}\\ \mathbb{E}_{\Theta^{(t)}}\left(\log|\boldsymbol{S}_{i}|\big{|}\boldsymbol{X}_{i}\right)&=\psi_{p}\left({\frac{\nu^{(t)}+p+q-1}{2}}\right)+p\log 2+\log\left|\frac{\boldsymbol{S}_{i}^{(t+1)}}{\nu^{(t)}+p+q-1}\right|\end{split}

\begin{split}\boldsymbol{S}_{i}^{(t+1)}\doteq\mathbb{E}_{\Theta^{(t)}}(\boldsymbol{S}_{i}|\boldsymbol{X}_{i})&=(\nu^{(t)}+p+q-1)[(\boldsymbol{X}_{i}-\mathbf{M}^{(t)})\boldsymbol{\Omega}^{(t)^{-1}}(\boldsymbol{X}_{i}-\mathbf{M}^{(t)})^{T}+\boldsymbol{\Sigma}^{(t)}]^{-1}\\ \mathbb{E}_{\Theta^{(t)}}\left(\log|\boldsymbol{S}_{i}|\big{|}\boldsymbol{X}_{i}\right)&=\psi_{p}\left({\frac{\nu^{(t)}+p+q-1}{2}}\right)+p\log 2+\log\left|\frac{\boldsymbol{S}_{i}^{(t+1)}}{\nu^{(t)}+p+q-1}\right|\end{split}

\begin{split}\boldsymbol{S}_{S}^{(t+1)}&\doteq\sum_{i=1}^{n}\boldsymbol{S}_{i}^{(t+1)},\\ \boldsymbol{S}_{SX}^{(t+1)}&\doteq\sum_{i=1}^{n}\mathbb{E}_{\Theta^{(t)}}(\boldsymbol{S}_{i}\boldsymbol{X}_{i}|\boldsymbol{X}_{i})=\sum_{i=1}^{n}\boldsymbol{S}_{i}^{(t+1)}\boldsymbol{X}_{i},\\ \boldsymbol{S}_{XSX}^{(t+1)}&\doteq\sum_{i=1}^{n}\mathbb{E}_{\Theta^{(t)}}(\boldsymbol{X}_{i}^{T}\boldsymbol{S}_{i}\boldsymbol{X}_{i}|\boldsymbol{X}_{i})=\sum_{i=1}^{n}\boldsymbol{X}_{i}^{T}\boldsymbol{S}_{i}^{(t+1)}\boldsymbol{X}_{i},\\ \boldsymbol{S}_{|S|}^{(t+1)}&\doteq\mathbb{E}_{\Theta^{(t)}}\left[\sum_{i=1}^{n}\log|\boldsymbol{S}_{i}|\bigg{|}\boldsymbol{X}_{i}\right].\end{split}

\begin{split}\boldsymbol{S}_{S}^{(t+1)}&\doteq\sum_{i=1}^{n}\boldsymbol{S}_{i}^{(t+1)},\\ \boldsymbol{S}_{SX}^{(t+1)}&\doteq\sum_{i=1}^{n}\mathbb{E}_{\Theta^{(t)}}(\boldsymbol{S}_{i}\boldsymbol{X}_{i}|\boldsymbol{X}_{i})=\sum_{i=1}^{n}\boldsymbol{S}_{i}^{(t+1)}\boldsymbol{X}_{i},\\ \boldsymbol{S}_{XSX}^{(t+1)}&\doteq\sum_{i=1}^{n}\mathbb{E}_{\Theta^{(t)}}(\boldsymbol{X}_{i}^{T}\boldsymbol{S}_{i}\boldsymbol{X}_{i}|\boldsymbol{X}_{i})=\sum_{i=1}^{n}\boldsymbol{X}_{i}^{T}\boldsymbol{S}_{i}^{(t+1)}\boldsymbol{X}_{i},\\ \boldsymbol{S}_{|S|}^{(t+1)}&\doteq\mathbb{E}_{\Theta^{(t)}}\left[\sum_{i=1}^{n}\log|\boldsymbol{S}_{i}|\bigg{|}\boldsymbol{X}_{i}\right].\end{split}

M^{(t + 1)}

M^{(t + 1)}

Ω^{(t + 1)}

Σ_{(t + 1)}^{- 1}

n \frac{d}{d ν} lo g Γ_{p} ((ν + p - 1) /2) - \frac{1}{2} (S_{∣ S ∣} - n p lo g 2 + n lo g ∣ Σ ∣) = 0.

n \frac{d}{d ν} lo g Γ_{p} ((ν + p - 1) /2) - \frac{1}{2} (S_{∣ S ∣} - n p lo g 2 + n lo g ∣ Σ ∣) = 0.

0 = n ψ_{p} ((ν + p - 1) /2) - {n ψ_{p} (\frac{κ}{2}) + i = 1 \sum n lo g \frac{S _{i}^{(t + 1)}}{κ} - n lo g \frac{S _{S}^{(t + 1)}}{n ( ν + p - 1 )}} = ψ_{p} ((ν + p - 1) /2) - {ψ_{p} (\frac{κ}{2}) + \frac{1}{n} i = 1 \sum n lo g Z_{i}^{(t + 1)} + p lo g \frac{n ( ν + p - 1 )}{κ} - lo g Z_{S}^{(t + 1)}},

0 = n ψ_{p} ((ν + p - 1) /2) - {n ψ_{p} (\frac{κ}{2}) + i = 1 \sum n lo g \frac{S _{i}^{(t + 1)}}{κ} - n lo g \frac{S _{S}^{(t + 1)}}{n ( ν + p - 1 )}} = ψ_{p} ((ν + p - 1) /2) - {ψ_{p} (\frac{κ}{2}) + \frac{1}{n} i = 1 \sum n lo g Z_{i}^{(t + 1)} + p lo g \frac{n ( ν + p - 1 )}{κ} - lo g Z_{S}^{(t + 1)}},

R_{i} (X) = trace {- \frac{1}{2} (Ω_{i}^{- 1} X^{T} Σ_{i}^{- 1} X) + Ω_{i}^{- 1} M_{i}^{T} Σ_{i}^{- 1} X - \frac{1}{2} Ω_{i}^{- 1} M_{i}^{T} Σ_{i}^{- 1} M_{i}} - \frac{1}{2} (p lo g ∣ Σ_{i} ∣ + q lo g ∣ Ω_{i} ∣) .

R_{i} (X) = trace {- \frac{1}{2} (Ω_{i}^{- 1} X^{T} Σ_{i}^{- 1} X) + Ω_{i}^{- 1} M_{i}^{T} Σ_{i}^{- 1} X - \frac{1}{2} Ω_{i}^{- 1} M_{i}^{T} Σ_{i}^{- 1} M_{i}} - \frac{1}{2} (p lo g ∣ Σ_{i} ∣ + q lo g ∣ Ω_{i} ∣) .

X_{i} ∣ M, Σ, Ω, ν, S_{i} S_{i} ∣ M, Ω, Σ, ν \sim N_{p, q} (M, S_{i}^{- 1}, Ω) \sim W_{p} (ν + p - 1, Σ^{- 1}), for i = 1, 2, \dots, n .

X_{i} ∣ M, Σ, Ω, ν, S_{i} S_{i} ∣ M, Ω, Σ, ν \sim N_{p, q} (M, S_{i}^{- 1}, Ω) \sim W_{p} (ν + p - 1, Σ^{- 1}), for i = 1, 2, \dots, n .

ℓ_{c} (M, Σ, Ω, ν; X, S)

ℓ_{c} (M, Σ, Ω, ν; X, S)

ℓ_{N} (M, S^{- 1}, Ω; X ∣ S) = - \frac{n p}{2} lo g ∣ Ω ∣ + \frac{q}{2} i = 1 \sum n lo g ∣ S_{i} ∣ - \frac{1}{2} tr [i = 1 \sum n S_{i} X_{i} Ω^{- 1} X_{i}^{T} + (i = 1 \sum n S_{i}) M Ω^{- 1} M^{T} - 2 (i = 1 \sum n S_{i} X_{i}) Ω^{- 1} M^{T}]

ℓ_{N} (M, S^{- 1}, Ω; X ∣ S) = - \frac{n p}{2} lo g ∣ Ω ∣ + \frac{q}{2} i = 1 \sum n lo g ∣ S_{i} ∣ - \frac{1}{2} tr [i = 1 \sum n S_{i} X_{i} Ω^{- 1} X_{i}^{T} + (i = 1 \sum n S_{i}) M Ω^{- 1} M^{T} - 2 (i = 1 \sum n S_{i} X_{i}) Ω^{- 1} M^{T}]

ℓ_{W} (ν, Σ; S) = (ν - 2) /2 i = 1 \sum n lo g ∣ S_{i} ∣ - i = 1 \sum n tr (Σ S_{i}) /2 - n ν p /2 lo g 2 + n (ν + p - 1) /2 lo g ∣ Σ ∣ - n lo g Γ_{p} ((ν + p - 1) /2) .

ℓ_{W} (ν, Σ; S) = (ν - 2) /2 i = 1 \sum n lo g ∣ S_{i} ∣ - i = 1 \sum n tr (Σ S_{i}) /2 - n ν p /2 lo g 2 + n (ν + p - 1) /2 lo g ∣ Σ ∣ - n lo g Γ_{p} ((ν + p - 1) /2) .

S_{S X} = i = 1 \sum n S_{i} X_{i}; S_{S} = i = 1 \sum n S_{i}; S_{X S X} = i = 1 \sum n X_{i}^{T} S_{i} X_{i} S_{∣ S ∣} = i = 1 \sum n lo g ∣ S_{i} ∣.

S_{S X} = i = 1 \sum n S_{i} X_{i}; S_{S} = i = 1 \sum n S_{i}; S_{X S X} = i = 1 \sum n X_{i}^{T} S_{i} X_{i} S_{∣ S ∣} = i = 1 \sum n lo g ∣ S_{i} ∣.

M

M

Ω

Σ^{- 1}

n ψ_{p} ((ν + p - 1) /2) - (S_{∣ S ∣} - n p lo g 2 + n lo g ∣ Σ ∣) = 0

n ψ_{p} ((ν + p - 1) /2) - (S_{∣ S ∣} - n p lo g 2 + n lo g ∣ Σ ∣) = 0

\begin{split}Q(\Theta;\Theta^{(t)})&=-\frac{np}{2}\log|\boldsymbol{\Omega}|-\frac{n\nu p\log 2}{2}-n\log\Gamma_{p}\left(\frac{\nu+p-1}{2}\right)+n\frac{\nu+p-1}{2}\log|\boldsymbol{\Sigma}|\\ &\quad+\mathbb{E}_{\Theta^{(t)}}\left\{\left[-\frac{1}{2}\mathrm{tr}\left(\sum_{i=1}^{n}{\boldsymbol{S}_{i}\boldsymbol{X}_{i}\boldsymbol{\Omega}^{-1}\boldsymbol{X}_{i}^{T}}+\sum_{i=1}^{n}{\boldsymbol{S}_{i}}\mathbf{M}\boldsymbol{\Omega}^{-1}\mathbf{M}^{T}-2\sum_{i=1}^{n}{\boldsymbol{S}_{i}\boldsymbol{X}_{i}}\boldsymbol{\Omega}^{-1}\mathbf{M}^{T}\right)\right.\right.\\ &\qquad\qquad\left.\left.+\frac{\nu-2}{2}\sum_{i=1}^{n}\log|\boldsymbol{S}_{i}|-\frac{1}{2}\sum_{i=1}^{n}\mathrm{tr}(\boldsymbol{\Sigma}\boldsymbol{S}_{i})+\frac{q}{2}\sum_{i=1}^{n}\log|\boldsymbol{S}_{i}|\bigg{|}\boldsymbol{X}_{1},\boldsymbol{X}_{2},\ldots,\boldsymbol{X}_{n}\right]\right\}.\end{split}

\begin{split}Q(\Theta;\Theta^{(t)})&=-\frac{np}{2}\log|\boldsymbol{\Omega}|-\frac{n\nu p\log 2}{2}-n\log\Gamma_{p}\left(\frac{\nu+p-1}{2}\right)+n\frac{\nu+p-1}{2}\log|\boldsymbol{\Sigma}|\\ &\quad+\mathbb{E}_{\Theta^{(t)}}\left\{\left[-\frac{1}{2}\mathrm{tr}\left(\sum_{i=1}^{n}{\boldsymbol{S}_{i}\boldsymbol{X}_{i}\boldsymbol{\Omega}^{-1}\boldsymbol{X}_{i}^{T}}+\sum_{i=1}^{n}{\boldsymbol{S}_{i}}\mathbf{M}\boldsymbol{\Omega}^{-1}\mathbf{M}^{T}-2\sum_{i=1}^{n}{\boldsymbol{S}_{i}\boldsymbol{X}_{i}}\boldsymbol{\Omega}^{-1}\mathbf{M}^{T}\right)\right.\right.\\ &\qquad\qquad\left.\left.+\frac{\nu-2}{2}\sum_{i=1}^{n}\log|\boldsymbol{S}_{i}|-\frac{1}{2}\sum_{i=1}^{n}\mathrm{tr}(\boldsymbol{\Sigma}\boldsymbol{S}_{i})+\frac{q}{2}\sum_{i=1}^{n}\log|\boldsymbol{S}_{i}|\bigg{|}\boldsymbol{X}_{1},\boldsymbol{X}_{2},\ldots,\boldsymbol{X}_{n}\right]\right\}.\end{split}

\begin{split}\boldsymbol{S}_{i}^{(t+1)}\doteq\mathbb{E}_{\Theta^{(t)}}(\boldsymbol{S}_{i}|\boldsymbol{X}_{i})&=(\nu^{(t)}+p+q-1)[(\boldsymbol{X}_{i}-\mathbf{M}^{(t)})\boldsymbol{\Omega}^{(t)^{-1}}(\boldsymbol{X}_{i}-\mathbf{M}^{(t)})^{T}+\boldsymbol{\Sigma}^{(t)}]^{-1},\\ \mathbb{E}_{\Theta^{(t)}}\left(\log|\boldsymbol{S}_{i}|\big{|}\boldsymbol{X}_{i}\right)&=\psi_{p}\left({\frac{\nu^{(t)}+p+q-1}{2}}\right)+p\log 2+\log\left|\frac{\boldsymbol{S}_{i}^{(t+1)}}{\nu^{(t)}+p+q-1}\right|,\end{split}

\begin{split}\boldsymbol{S}_{i}^{(t+1)}\doteq\mathbb{E}_{\Theta^{(t)}}(\boldsymbol{S}_{i}|\boldsymbol{X}_{i})&=(\nu^{(t)}+p+q-1)[(\boldsymbol{X}_{i}-\mathbf{M}^{(t)})\boldsymbol{\Omega}^{(t)^{-1}}(\boldsymbol{X}_{i}-\mathbf{M}^{(t)})^{T}+\boldsymbol{\Sigma}^{(t)}]^{-1},\\ \mathbb{E}_{\Theta^{(t)}}\left(\log|\boldsymbol{S}_{i}|\big{|}\boldsymbol{X}_{i}\right)&=\psi_{p}\left({\frac{\nu^{(t)}+p+q-1}{2}}\right)+p\log 2+\log\left|\frac{\boldsymbol{S}_{i}^{(t+1)}}{\nu^{(t)}+p+q-1}\right|,\end{split}

\begin{split}\boldsymbol{S}_{S}^{(t+1)}&\doteq\sum_{i=1}^{n}\boldsymbol{S}_{i}^{(t+1)},\\ \boldsymbol{S}_{SX}^{(t+1)}&\doteq\sum_{i=1}^{n}\mathbb{E}_{\Theta^{(t)}}(\boldsymbol{S}_{i}\boldsymbol{X}_{i}|\boldsymbol{X}_{i})=\sum_{i=1}^{n}\boldsymbol{S}_{i}^{(t+1)}\boldsymbol{X}_{i},\\ \boldsymbol{S}_{XSX}^{(t+1)}&\doteq\sum_{i=1}^{n}\mathbb{E}_{\Theta^{(t)}}(\boldsymbol{X}_{i}^{T}\boldsymbol{S}_{i}\boldsymbol{X}_{i}|\boldsymbol{X}_{i})=\sum_{i=1}^{n}\boldsymbol{X}_{i}^{T}\boldsymbol{S}_{i}^{(t+1)}\boldsymbol{X}_{i},\\ \boldsymbol{S}_{|S|}^{(t+1)}&\doteq\mathbb{E}_{\Theta^{(t)}}\left[\sum_{i=1}^{n}\log|\boldsymbol{S}_{i}|\bigg{|}\boldsymbol{X}_{i}\right],\end{split}

\begin{split}\boldsymbol{S}_{S}^{(t+1)}&\doteq\sum_{i=1}^{n}\boldsymbol{S}_{i}^{(t+1)},\\ \boldsymbol{S}_{SX}^{(t+1)}&\doteq\sum_{i=1}^{n}\mathbb{E}_{\Theta^{(t)}}(\boldsymbol{S}_{i}\boldsymbol{X}_{i}|\boldsymbol{X}_{i})=\sum_{i=1}^{n}\boldsymbol{S}_{i}^{(t+1)}\boldsymbol{X}_{i},\\ \boldsymbol{S}_{XSX}^{(t+1)}&\doteq\sum_{i=1}^{n}\mathbb{E}_{\Theta^{(t)}}(\boldsymbol{X}_{i}^{T}\boldsymbol{S}_{i}\boldsymbol{X}_{i}|\boldsymbol{X}_{i})=\sum_{i=1}^{n}\boldsymbol{X}_{i}^{T}\boldsymbol{S}_{i}^{(t+1)}\boldsymbol{X}_{i},\\ \boldsymbol{S}_{|S|}^{(t+1)}&\doteq\mathbb{E}_{\Theta^{(t)}}\left[\sum_{i=1}^{n}\log|\boldsymbol{S}_{i}|\bigg{|}\boldsymbol{X}_{i}\right],\end{split}

M

M

Ω

= \frac{1}{n p} (S_{X S X}^{(t + 1)} - S_{S X}^{(t + 1)}^{T} S_{S}^{(t + 1)}^{- 1} S_{S X}^{(t + 1)}),

Σ^{- 1}

n \frac{d}{d ν} lo g Γ_{p} ((ν + p - 1) /2) - \frac{1}{2} (S_{∣ S ∣} - n p lo g 2 + n lo g ∣ Σ ∣) = 0.

n \frac{d}{d ν} lo g Γ_{p} ((ν + p - 1) /2) - \frac{1}{2} (S_{∣ S ∣} - n p lo g 2 + n lo g ∣ Σ ∣) = 0.

0

0

= ψ_{p} ((ν + p - 1) /2) - (ψ_{p} (\frac{κ ^{(t)}}{2}) + \frac{1}{n} i = 1 \sum n lo g Z_{i}^{(t + 1)} + p lo g \frac{n ( ν ^{(t)} + p - 1 )}{κ ^{(t)}} - lo g Z_{S}^{(t + 1)})

0

0

= ψ_{p} ((ν + p - 1) /2) - {ψ_{p} (\frac{κ}{2}) + \frac{1}{n} i = 1 \sum n lo g Z_{i}^{(t + 1)} + p lo g \frac{n ( ν + p - 1 )}{κ} - lo g Z_{S}^{(t + 1)}} .

M

M

M

M

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Classification with the matrix-variate- $t$ distribution

Geoffrey Z. Thompson, Ranjan Maitra, William Q. Meeker and Ashraf Bastawros The authors are with Iowa State University. Email: {gzt,maitra,wqmeeker,bastaw}@iastate.edu.This research was supported in part by the National Institute of Justice (NIJ) under Grants No. 2015-DN-BX-K056 and 2018-R2-CX-0034. The research of the second author was also supported in part by the National Institute of Biomedical Imaging and Bioengineering (NIBIB) of the National Institutes of Health (NIH) under Grant R21EB016212, and the United States Department of Agriculture (USDA) National Institute of Food and Agriculture (NIFA) Hatch project IOW03617. The content of this paper is however solely the responsibility of the authors and does not represent the official views of the NIJ, the NIBIB, the NIH, the NIFA or the USDA.

Abstract

Matrix-variate distributions can intuitively model the dependence structure of matrix-valued observations that arise in applications with multivariate time series, spatio-temporal or repeated measures. This paper develops an Expectation-Maximization algorithm for discriminant analysis and classification with matrix-variate $t$ -distributions. The methodology shows promise on simulated datasets or when applied to the forensic matching of fractured surfaces or the classification of functional Magnetic Resonance, satellite or hand gestures images.

Index Terms:

BIC, ECME, fMRI, fracture mechanics, LANDSAT, supervised learning

I Introduction

Matrix-variate distributions [1] can conveniently model matrix-valued observations that arise, for instance, with multivariate time series or spatial datasets or when we observe $p$ -variate responses at $q$ different settings, yielding a $p\times q$ matrix of responses for each unit of observation. The matrix-variate normal distribution (abbreviated in this paper by MxVN) is also helpful for inference, but is sometimes inadequate for modeling populations where the matrix-variate- $t$ distribution (henceforth MxV $t$ ) may be a better fit.

There exist discriminant analysis and classification methods for MxVN [2, 3] mixtures but for many applications, the MxV $t$ distribution may model each group better. However, parameter estimation for the MxV $t$ distribution requires special care because, unlike in the normal case, it can not be viewed as simply a rearrangement of its vector-multivariate cousin [4] for which several variants of the Expectation-Maximization (EM) algorithm exist [5, 6, 7].

This paper develops, in Section II, methodology for parameter estimation in the MxV $t$ distribution and extends it to discriminant analysis and classification using MxV $t$ mixtures. Our methods are evaluated on simulated and real-life datasets in Section III. This paper concludes with some discussion. An online supplement explicitly detailing the derivations of our algorithm, with sections referenced using the prefix “S-”, and an R [8] package MixMatrix [9] that implements the methodology are also included.

II Methodology

II-A Background and Preliminary Development

II-A1 The Matrix-variate Normal Distribution

Definition 1.

A random matrix $\boldsymbol{X}$ of $p$ rows and $q$ columns has the MxVN distribution with parameters $\mathbf{M},\boldsymbol{\Sigma}$ and $\boldsymbol{\Omega}$ if it has the probability density function (PDF)

[TABLE]

where $|\cdot|$ denotes the determinant, $\mathbf{M}$ is a $p\times q$ matrix that is the mean of $\boldsymbol{X}$ , and $\boldsymbol{\Sigma}$ and $\boldsymbol{\Omega}>0$ describing the covariances between, respectively, each of the $p$ rows and the $q$ columns of $\boldsymbol{X}$ . We write $\boldsymbol{X}\sim\mathcal{N}_{p,q}(\mathbf{M},\boldsymbol{\Sigma},\boldsymbol{\Omega})$ . For identifiability, we set the first element of $\boldsymbol{\Sigma}$ to be unity.

The MxVN distribution can be considered, after rearranging into a vector (denoted by vec( $\boldsymbol{X}$ )), to be from a multivariate normal (MVN) distribution with a Kronecker product covariance structure [1]. So, if $\boldsymbol{X}\sim\mathcal{N}_{p,q}(\mathbf{M},\boldsymbol{\Sigma},\boldsymbol{\Omega})$ , then $\mathrm{vec}(\boldsymbol{X})\sim\mathcal{N}_{pq}(\mathrm{vec}(\mathbf{M}),\boldsymbol{\Omega}\otimes\boldsymbol{\Sigma}).$ This reformulation allows us to readily obtain the maximum likelihood (ML) estimates. If $\boldsymbol{X}_{i}$ , $i=1,2,\ldots,n$ are independent identically distributed (IID) random matrices from the $\mathcal{N}_{p,q}(\mathbf{M},\boldsymbol{\Sigma},\boldsymbol{\Omega})$ , then $\mathbf{M},\boldsymbol{\Sigma},$ and $\boldsymbol{\Omega}$ have ML estimators $\widehat{\mathbf{M}}={n}^{-1}\sum_{i=1}^{n}\boldsymbol{X}_{i},\widehat{\boldsymbol{\Sigma}}=(np)^{-1}\sum_{i=1}^{n}(\boldsymbol{X}-\widehat{\mathbf{M}})\widehat{\boldsymbol{\Omega}}^{-1}(\boldsymbol{X}-\widehat{\mathbf{M}})^{T}$ , and $\widehat{\boldsymbol{\Omega}}=(nq)^{-1}\sum_{i=1}^{n}(\boldsymbol{X}-\widehat{\mathbf{M}})^{T}\widehat{\boldsymbol{\Sigma}}^{-1}(\boldsymbol{X}-\widehat{\mathbf{M}})$ under the constraint for identifiability that the $(1,1)$ th element of $\widehat{\boldsymbol{\Sigma}}$ is set to unity. The matrices $\widehat{\boldsymbol{\Sigma}}$ and $\widehat{\boldsymbol{\Omega}}$ are obtained iteratively after initialization with any positive definite matrices, by default using the identity matrix $\boldsymbol{I}$ . The ML estimates exist and are unique almost surely if $n>p/q+q/p+2$ [10].

II-A2 The Matrix-variate $t$ -distribution

Definition 2.

A random $p\times q$ matrix $\boldsymbol{X}$ has a MxV $t$ distribution with parameters $(\mathbf{M},\boldsymbol{\Sigma},\boldsymbol{\Omega}$ ) of similar order as in Definition 1 (with $\boldsymbol{\Sigma}$ and $\boldsymbol{\Omega}>0$ ) and degrees of freedom (df) $\nu\geq 1$ if its PDF is

[TABLE]

We use the notation $\boldsymbol{X}\sim t_{p,q}(\nu,\mathbf{M},\boldsymbol{\Sigma},\boldsymbol{\Omega})$ to indicate that $\boldsymbol{X}$ has this density.

Properties:

We mention some properties of the MxVt distribution relevant to this paper.

For $p=1$ and $\boldsymbol{\Sigma}\equiv\nu$ (or $q=1$ and $\boldsymbol{\Omega}\equiv\nu$ ), the MxV $t$ distribution reduces to its vector-multivariate $t$ (MVT) cousin. However, this reduction does not generally hold so additional development is needed for inference. We provide methods to do so in the next section. 2. 2.

Let the random matrix $\boldsymbol{S}\sim{\mathcal{W}}_{p}(\nu+p-1,\boldsymbol{\Sigma}^{-1})$ , where ${\mathcal{W}}_{p}(\kappa,\boldsymbol{\Psi})$ is the $p\times p$ -dimensional Wishart distribution with d.f. $\kappa$ and scale matrix $\boldsymbol{\Psi}$ . If $\boldsymbol{X}\mid\boldsymbol{S}\sim{\mathcal{N}}_{p,q}(\mathbf{M},\boldsymbol{S}^{-1},\boldsymbol{\Omega})$ , then $\boldsymbol{X}\sim t_{p,q}(\nu,\mathbf{M},\boldsymbol{\Sigma},\boldsymbol{\Omega})$ [[, see]p. 135]gupta1999matrix. Further, $\boldsymbol{S}\mid\boldsymbol{X}\sim{\mathcal{W}}_{p}(\nu+p+q-1,[(\boldsymbol{X}-\mathbf{M})\boldsymbol{\Omega}^{-1}(\boldsymbol{X}-\mathbf{M})^{T}+\boldsymbol{\Sigma}]^{-1})$ [11].

II-B ML Estimation of the MxV $t$ parameters

The MxV $t$ distribution does not have closed-form ML estimators so we provide an Expectation/Conditional Maximization Either (ECME) algorithm [7] in a manner that is similar to that used to find ML parameter estimates in the MVT distribution, with the main contribution being the extension to the matrix variate case by deriving the estimates in terms of a matrix variate normal mixture with a Wishart distribution rather than a multivariate normal mixture with a chi-squared distribution.

Let $\boldsymbol{X}_{i},i=1,2,\ldots,n$ be IID realizations from $t_{p,q}(\nu,\mathbf{M},\boldsymbol{\Sigma},\boldsymbol{\Omega})$ . Write $\boldsymbol{\Theta}\equiv\{\nu,\mathbf{M},\boldsymbol{\Sigma},\boldsymbol{\Omega}\}$ . For each $i=1,2,\ldots,n$ , let $\boldsymbol{S}_{i}$ be (unobserved) Wishart-distributed random matrices that are as per Property 2.

From the detailed development and derivations provided in the supplement Section S-5.2, we get the expectation step (E-step) updates at the current value $\boldsymbol{\Theta}^{(t)}$ of $\boldsymbol{\Theta}$ by taking the expected values of the $\boldsymbol{S}_{i}$ given the current value of $\boldsymbol{\Theta}^{(t)}$ :

[TABLE]

where $\psi_{p}(\cdot)$ is the $p$ -variate digamma function – that is, $\psi_{p}(x)=\frac{d}{dx}\log\Gamma_{p}(x)$ . Further computational and notational reductions are possible by defining and storing the updates in terms of the expected sufficient statistics

[TABLE]

These statistics can be expressed with $(\nu^{(t)}+p+q-1)$ factored out, and for convenience may be computed and stored as such when $\nu$ needs to be estimated. These quantities can be computed in $\boldsymbol{O}(npq^{2})+\boldsymbol{O}(np^{2}q)+\boldsymbol{O}(np^{3})$ flops.

The M-step updates are split into two conditional maximization steps, one updating $(\mathbf{M},\boldsymbol{\Sigma},\boldsymbol{\Omega})$ and one updating $\nu$ . The first step is conceptually immediate and maximizes Equation (S-5) in the supplement with respect to $\boldsymbol{\Theta}$ to yield $\boldsymbol{\Theta}^{(t+1)}$ , with updates of $(\mathbf{M},\boldsymbol{\Sigma},\boldsymbol{\Omega})$ given $\nu^{(t)}$ as follows:

[TABLE]

These quantities can be computed in $\boldsymbol{O}(pq^{2})+\boldsymbol{O}(p^{2}q)+\boldsymbol{O}(p^{3})$ flops. As discussed in Section S-5.21, if the previous $\boldsymbol{\Sigma}^{(t)}$ and $\boldsymbol{\Omega}^{(t)}$ were positive definite, then the updates exist and $\widehat{\boldsymbol{\Sigma}}$ is positive definite, though the necessary or sufficient conditions for $\widehat{\boldsymbol{\Omega}}$ to be positive definite (a.s.) are not known.

The conditional maximization of $\nu$ given $(\mathbf{M}^{(t+1)},\boldsymbol{\Sigma}^{(t+1)},\boldsymbol{\Omega}^{(t+1)})$ can be sped up substantially by maximizing it instead over the observed log-likelihood function given $(\mathbf{M}^{(t+1)},\boldsymbol{\Sigma}^{(t+1)},\boldsymbol{\Omega}^{(t+1)})$ . We get the ML estimating equation (MLEE):

[TABLE]

Writing $\kappa=\nu+p+q-1$ for notational compactness, we have,

[TABLE]

where $\boldsymbol{Z}_{*}^{(t+1)}$ is the appropriate $\boldsymbol{S}_{*}^{(t+1)}$ statistic with $(\nu^{(t)}+p+q-1)$ factored out. The MLEE can be solved using a one-dimensional search, yielding an ECME algorithm with the steps:

E-step: Update $\boldsymbol{S}_{i}$ weights and statistics based on $\Theta^{(t)}$ and $\boldsymbol{X}_{i}$ . 2. 2.

CME-step: Update $\Theta_{1}^{(t+1)}=(\mathbf{M}^{(t+1)},\boldsymbol{\Sigma}^{(t+1)},\boldsymbol{\Omega}^{(t+1)})$ . 3. 3.

CME-step: Update $\Theta_{2}^{(t+1)}=\nu^{(t+1)}$ using the observed log-likelihood given $(\mathbf{M}^{(t+1)},\boldsymbol{\Sigma}^{(t+1)},\boldsymbol{\Omega}^{(t+1)})$ .

Repeat these steps until convergence.

As explained in Section S-5.21. each iteration of this algorithm takes $\boldsymbol{O}(npq^{2})+\boldsymbol{O}(np^{2}q)+\boldsymbol{O}(np^{3})$ flops in addition to the number of iterations required to estimate $\nu$ in the second CME step. This suggests that the orientation of the matrices should be chosen such that the row dimension $p<q$ , a suggestion that is also experimentally verified in Section III-A.

We conclude here by noting that restrictions on the parameter set [12], such as imposing an $m$ th-order auto-regressive structure (AR( $m$ )) on either $\boldsymbol{\Sigma}$ or $\boldsymbol{\Omega}$ or both as in our applications in Section III can be easily incorporated within our algorithm (see Section S-5.22).

II-C Discrimination and Classification

Linear (LDA) and Quadratic Discriminant Analysis (QDA) for matrix-variate populations follow a similar approach as for the multivariate case, with the MxVN (but not MxV $t$ ) cases affording substantial reductions in the computations. We provide here the general framework for matrix-variate distributions and then discuss reductions for the cases of the MxVN models.

Suppose that there are two populations $\pi_{1}$ and $\pi_{2}$ , with prior probabilities $\eta_{1}$ and $\eta_{2}$ for an observation belonging to either. Let $\mathbb{P}(1|2)$ be the probability of classifying a member of $\pi_{2}$ to $\pi_{1}$ (and vice versa). As usual, the total probability of misclassification (TPM) is defined to be $\mathbb{P}(2|1)\eta_{1}+\mathbb{P}(1|2)\eta_{2}$ . A Bayes optimal classification rule that minimizes the TPM assigns a matrix-valued observation $\boldsymbol{X}$ to $\pi_{1}$ if $\frac{f_{1}(\boldsymbol{X})}{f_{2}(\boldsymbol{X})}\geq\frac{\eta_{2}}{\eta_{1}}$ , where $f_{i}(\boldsymbol{X})$ is the PDF for group $\pi_{i}$ evaluated at $\boldsymbol{X}$ [13]. The classification rule can be easily extended to the case when there are $G$ groups $\pi_{1},\pi_{2},\ldots,\pi_{G}$ , each with prior probabilities of membership $\eta_{1},\eta_{2},\ldots,\eta_{G}$ and densities $f_{1},f_{2},\ldots,f_{g}$ . Then the Bayes optimal classification for a matrix-valued observation $\boldsymbol{X}$ is $\operatornamewithlimits{argmax}_{i\in\{1,2,\ldots,g\}}R_{i}(\boldsymbol{X})$ , where the cost function $R_{i}(\boldsymbol{X})$ is defined as $\log\eta_{i}f_{i}(\boldsymbol{X})$ .

Unlike for the MxV $t$ distributions, the MxVN case has closed-form solutions analogous to that of LDA or QDA in multivariate statistics. For the MxVN populations, the closed-form classification rule assigns $\boldsymbol{X}$ to the $g$ th group where $g=\operatornamewithlimits{argmax}_{i=1,2,\ldots,G}R_{i}(\boldsymbol{X})$ , with

[TABLE]

The first and last term disappear when the $G$ MxVN populations have common covariances, yielding a linear decision rule. Many adaptations [14, 15, 16, 17, 18, 19, 20, 21, 22, 23] of LDA exist for homogeneous MxVN populations, but our development provides a natural and direct approach that is also flexible enough to include a range of assumptions. Assuming homogeneity does not yield a linear rule for MxV $t$ populations where we still get a quadratic rule. Finally, in all cases, the parameters in $R(\boldsymbol{X})$ can be estimated using ML on the training set (with the ECME methodology of Section II-B for MxV $t$ populations) and incorporated into the decision rule.

III Performance Evaluations

This section evaluates performance of the ECME algorithm in recovering the MxV $t$ parameters and also classification performance of our methodology on some real-life datasets.

III-A Simulation Study

Our simulation study generated 200 datasets from the $t_{5,3}(\nu,\mathbf{M},\boldsymbol{\Sigma},\boldsymbol{\Omega})$ distribution with $\nu=5,10,20$ and $n\in\{35,50,100\}$ , with the smallest $n$ chosen to be larger than the number of parameters to be estimated, which was also large enough for all but one of the simulations to converge, and the larger sample sizes were chosen to give an idea of consistency of parameter estimation. The ECME algorithm in Section II-B, with unconstrained $(\mathbf{M},\boldsymbol{\Sigma},\boldsymbol{\Omega})$ was used to estimate the parameters. Figure 1 summarizes the estimated $\hat{\nu}$ over the 200 samples for each $\nu$ . (We constrain $\hat{\nu}$ to be in $(2,1000)$ in the Either step of the ECME algorithm.)

As expected, higher $n$ improves both accuracy and precision of the estimates. For all nine cases, the peak of the distribution of $\hat{\nu}$ was close to the true $\nu$ value. Lower values for $\nu$ were more easily estimated in the sense that for any $n$ , the $\hat{\nu}$ values are closer to the true $\nu$ (Figure 1). This may be because for larger true $\nu$ , the distributions are similar in a wider range (after all, as $\nu\rightarrow\infty$ , the distribution reduces to the MxVN). For one aberrant sample with $\nu=20$ and $n=35$ , the optimizer attained the upper bound and did not converge to an interior point. On the whole, however, our simulation results indicate good performance of the ECME algorithm in recovering the MxV $t$ parameters. Additional information about convergence and the recovery of the center and scatter parameters is contained in Section S-5.31.

In another simulation study, the speed of the algorithm was demonstrated for $p=5,\,25,\,100$ , $q=5,\,25,\,100$ , and $N=100,\,500$ with [math] mean and identity spread parameters for $\nu=5$ . Figure LABEL:fig:timing summarizes the results of the simulation. As the derivation in Section II-B suggests, the row dimension $p$ dominates when determining the speed of the computation. Increasing column dimension $q$ for a given row dimension $p$ seems to hasten convergence to some extent, likely because they act to effectively increase the sample size, which reduces the number of iterations the algorithm needs to run. This suggests the orientation should be chosen so that $p<q$ .

Section S-5.31 also details results from a a simulation study showing the performance of the method when the type of model (MxV $t$ or MxVN) or degrees of freedom $\nu$ are misspecified. We see that models that are fit with $\nu$ closer to the true value perform better than those that do not. Also, as expected, the performance using MxV $t$ more closely approaches that under the MxVN model as $\nu$ increases.

III-B Classification Examples

We evaluate MxV $t$ classification and discrimination on four different datasets.

III-B1 Matching Fractured Surfaces

Our first example is on the potential ability of our classification algorithm to distinguish between pairs of fractured surfaces into matches or non-matches, with implications in forensics to decide on, say, whether a knife blade fragment found at a crime scene is a match to something that visually appears to be the remainder of the blade. Because of the novelty of this application, we discuss it at some length here. Our investigation is a formal proof-of-concept conducted in the lab where a set of 38 stainless steel knives had their blades broken under similar conditions, resulting in each of them having a base and a tip. The cross-sectional fractured surfaces were then scanned using a standard non-contact 3D optical interferometer at 9 regularly-spaced locations to get 9 successive $1024\times 1024$ images (with 75% overlap, in order to get a reasonable number of replications while also imaging the entire length of the exposed surfaces.

Cross-correlations between matching knife base-tip image pairs in the 5-10 $\mu m^{-1}$ and 10-20 $\mu m^{-1}$ two-dimensional (2D) Fourier frequencies were computed, yielding, for each knife, a $2\times 9$ matrix of measurements describing the similarity of the base of the knife to the tip (2 measured cross-correlations per image and 9 images). Similar cross-correlations between all possible knife base-tip pairs (regardless of origin) yielded a sample from the population of similarity matrices coming from known matching (KM) and known non-matching (KNM) base-tip pairs.

Figure LABEL:fig:knifecorr shows the scatterplot of the Fisher’s $Z$ -transformed [24] cross-correlation data to be fairly elliptical. The two classes are almost but not completely separated when looking at individual image pairs. Classification using only one pair of images per surface rather than a set of multiple images is potentially ambiguous. We remove this potential ambiguity by considering multiple images on each surface. These multiple sets of images on each knife are not independent and have a natural multivariate repeated measures (i.e. matrix-variate) structure because of the 75% overlap between successive images so a model incorporating such structure may improve classification accuracy.

We model each match/non-match dataset in terms of the MxV $t$ distribution with group-specific mean matrix and matrix dispersion structures, with an AR(1) correlation structure for the Fourier domain correlations at the same frequency band between successive (overlapping) image pairs. The AR(1) structure is appropriate because of the overlap between successive images: this correlation structure also has the best Bayesian Information Criterion (BIC) among the correlation structures tested on the data [25]. The mean across the images for each frequency band was constrained to be constant. Because there are only 9-10 observations for the cases where the knife tip-base have the same origin, we forgo estimating $\nu$ and instead investigate classification with the MxV $t$ distribution with $\nu=5$ and $\nu=10$ (in addition to the MxVN).

Figure LABEL:fig:knifeclass displays the distribution of the log-odds of being a match for the models based on each of the four training sets. The models trained on each set were then tested on the data from all four sets of surfaces. In this figure, a positive log-odds indicates a higher probability of being a KM and a negative log-odds indicates a higher probability of being a KNM. With equal priors, there is a 0% false exclusion (false negative) rate and a 0.003% false identification (false positive) rate (1 FP). The only FP is from the MxVN model, which is also overly confident about the matches it produces. It predicts some surfaces being a match with log-odds greater than 200, which is extremely implausible. The MxV $t$ distribution accounts for uncertainty better and results in more plausible log-odds ratios. This is because the normal distribution is much more thin in its tails than the $t$ -distribution, and increasing the dimensionality as occurs in the matrix variate case multiplies this effect. This means the MxVN will penalize observations far from the center of the class more than the MxV $t$ . Perfect discrimination is attained with MxV $t$ for all four training sets, suggesting that the results generalize well to out-of-sample data despite the relatively small sample size. For comparison, we also obtained predictions using the penalized likelihood approach of [23] which works only when at least two sets of knives are used as training sets, two sets as a validation set for the tuning parameters, and the rest as the test set. We were able to obtain perfect classification for all permutations of the six sets when an appropriate grid of tuning parameters is used (two additional sets of images taken from one set of knives were used to make a total of six sets). However, the method forced the scatter matrices to be diagonal, which is unlikely to be reasonable given the 75% overlap between successive images.

III-B2 Finger-tapping Experiment

[26] provided 12 functional Magnetic Resonance Imaging (fMRI) scans of the brain of a right-hand dominant male subject during a right-hand finger-thumb opposition activity and 12 similar scans using the left-hand, with each pair of scan collected at regular intervals over a 2-month period. We restrict attention to the 20th slice of the image volume, with $128\times 128$ pixels, that previous work [27, 28] indicated as adequate to distinguish activation between the left- and right-hand finger-tapping. With only 12 observations per class, we are limited in the types of correlation matrices that we may consider, so we selected a $20\times 20$ section of the 20th slice having the left-topmost pixel at $(33,67)$ , which was the $20\times 20$ section of the slice displaying the highest average activation in the left-hand activation images as determined by [29]’s FAST-fMRI algorithm. We then trained and tested the classifiers using the leave-one-out method with an AR(1) covariance structure and a compound symmetry covariance structure in the MxVN and MxV $t$ distributions with $\nu=5$ or $\nu=10$ (for the MxV $t$ ). The BIC on the fitted models indicated that a compound symmetry covariance structure was the best model. In all cases, except that of the MxV $t$ distribution with $\nu=10$ and an AR(1) covariance structure, 23 out 24 images were correctly classified. The one mislabeled case was the same one that was previously identified by [28] as an outlier. Using the MxV $t$ distribution with $\nu=10$ had one more misclassification. The number of cases for this reduced dataset is not enough for MatrixLDA to estimate the correlation structure so we forgo that comparison here.

III-B3 Landsat Satellite Data

Multi-spectral satellite imagery allows for multiple observations over a spatial grid, yielding matrix-valued observations. We examine a set of satellite images [30] that are in two visible and two infrared bands. The subset of images under consideration [31] consists of a training and a test set of $3\times 3$ pixel segments labeled according to the terrain type (961 gray soil, 415 damp gray soil, and 470 soil with vegetation stubble segments, for 1846 total observations in the training set and 397, 311, 237, and 845 total in the test set) of their middle pixel. Each observation, then, is a 9-pixel segment with a label according to soil type, and the problem is to predict the soil type from the pixel values. Regarding the data as a $4\times 9$ matrix and with an MxVN classifier and unconstrained covariance matrices yielded an error rate of 0.116 [31], while MatrixLDA with tuning parameters selected by 5-fold CV [23] yielded a 0.118 error rate. Our MxVN and MxV $t$ models (the latter with $\nu=10$ and 20) with unconstrained covariance matrices and prior probabilities equal to the class representation in the training set yielded error rates of 0.126, 0.116, and 0.109, in line with previous results. BIC indicated that using unconstrained covariance matrices and means constrained to be equal within rows as a better model, with error rates of 0.123, 0.121, and 0.107.

III-B4 Cambridge Hand Gestures Data

We tested our method using leave-one-out cross-validation (LOOCV) on the set of 80 images extracted from the Cambridge hand gestures database [32] as processed by [23] into $80\times 60$ pixel gray-scale images. There are four classes in this problem: the images show a hand gesture in one of two shapes and one of two orientations: in each image, the hand is either in a flat or “V” shape and is located either in the center of the image or to the left side of the image. We fit models with an AR(1) structure on both dimensions, compound symmetric structure on both, and an unconstrained covariance structure, with 5 and 10 degrees of freedom for the MxV $t$ distribution. The AR(1) structure provided the best fit according to BIC, and by using it we were able to obtain a 100% classification rate using LOOCV on the dataset. [23] report a 90% correct classification rate using LOOCV on this dataset.

IV Conclusions

We have provided an ECME method for fitting the parameters of the MxV $t$ distribution that can be used on three-way data sets such as multivariate repeated measures, image or spatial data, and have demonstrated the method on simulation datasets and on classification and discrimination in four real-world applications where the new method using the MXV $t$ -distribution outperforms that using the MxVN. The ECME algorithm and the discriminant analysis are implemented in the R package MixMatrix. The package also includes functions for sampling from and computing the density of the MxVN and MxV $t$ distributions and includes the datasets used in this paper.

Our model can be extended beyond supervised learning to mixture model-based clustering and can be made to accommodate more specialized covariance structures such as those described in [33] and [34]. It may also be readily extended to cases with incomplete records. Determining the existence, convergence and uniqueness properties would also be desirable. For instance, we know how many observations are required to have unique ML estimates of the parameters in the MxVN distribution with unconstrained mean and covariance matrices but such results may be useful to develop for the MxV $t$ or the constrained MxVN. Nevertheless, the EM algorithm is guaranteed to converge to a local stationary point, provided it is initialized where the log-likelihood function is finite [35]. Finally, another area that could benefit from further development is the extension of MatrixLDA to include the MxV $t$ distribution, where we believe our development in this paper will be helpful.

Funding Information

This research was supported in part by the National Institute of Justice (NIJ) under Grants No. 2015-DN-BX-K056 and 2018-R2-CX-0034. The research of the second author was also supported in part by the National Institute of Biomedical Imaging and Bioengineering (NIBIB) of the National Institutes of Health (NIH) under Grant R21EB016212, and the United States Department of Agriculture (USDA) National Institute of Food and Agriculture (NIFA) Hatch project IOW03617. The content of this paper is however solely the responsibility of the authors and does not represent the official views of the NIJ, the NIBIB, the NIH, the NIFA or the USDA.

Supplementary Materials

S-5 The EM algorithm for parameter estimation in the MxV $t$ distribution

As mentioned in the paper, the MxV $t$ distribution does not have closed-form ML estimators so we develop an EM algorithm by augmenting the data, in similar spirit as done for the vector-multivariate $t$ -distribution [36], and then present an ECME (Expectation/Conditional Maximization Either) algorithm [7] to improve the speed of convergence of the EM algorithm. Let $\boldsymbol{X}_{1},\boldsymbol{X}_{2},\ldots,\boldsymbol{X}_{n}$ be independent realizations from the $t_{p,q}(\nu,\mathbf{M},\boldsymbol{\Sigma},\boldsymbol{\Omega})$ density. Then each $\boldsymbol{X}_{i}$ can be augmented with latent Wishart-distributed weight matrices $\boldsymbol{S}_{i}$ as follows:

[TABLE]

To show the benefits of using the latent $\boldsymbol{S}_{i}$ s, we first derive ML estimators with the complete data and then use that to derive an EM algorithm using only the observed data. We then modify the EM algorithm to its more efficient ECME derivative.

S-5.1 ML Estimation of parameters with complete data

Suppose that we have $(\boldsymbol{X}_{i},\boldsymbol{S}_{i}),i=1,2,\ldots,n$ where each $\boldsymbol{S}_{i}\sim{\mathcal{W}}_{p}(\nu+p-1,\boldsymbol{\Sigma}^{-1})$ and $\boldsymbol{X}_{i}\mid\boldsymbol{S}_{i}\sim{\mathcal{N}}_{p,q}(\mathbf{M},\boldsymbol{S}^{-1},\boldsymbol{\Omega})$ for each $i=1,2,\ldots,n$ . Then the complete log-likelihood function $\ell_{c}$ of the parameters $(\mathbf{M},\boldsymbol{\Omega})$ given the data $(\boldsymbol{X}_{i},\boldsymbol{S}_{i}),i=1,2,\ldots,n$ can written as a sum of (conditional) MxVN log-likelihood functions $\ell_{N}$ and a sum of Wishart log-likelihood functions $\ell_{W}$ :

[TABLE]

From the definitions of the MxVN and Wishart distributions, we have, after ignoring additive constants,

[TABLE]

and

[TABLE]

To simplify computation of the ML estimators and their notation, we define the following complete data sufficient statistics for the parameters:

[TABLE]

Taking derivatives of log-likelihoods yields the ML estimates:

[TABLE]

The ML estimate of $\nu$ can be obtained by finding the root of the equation:

[TABLE]

with $\psi_{p}(\cdot)$ the $p$ -variate digamma function, defined as $\psi_{p}(x)=d\log\Gamma_{p}(x)/dx$ . The ML estimate of $\nu$ may be obtained numerically by a one-dimensional search algorithm. We now use the development in this section in our EM algorithm for a sample from the MxV $t$ distribution.

S-5.2 Estimating parameters from a MxV $t$ sample

S-5.21 The EM algorithm

Let $\boldsymbol{X}_{i},i=1,2,\ldots,n$ be independent identically distributed realizations from $t_{p,q}(\nu,\mathbf{M},\boldsymbol{\Sigma},\boldsymbol{\Omega})$ . As in the main article, we write $\boldsymbol{\Theta}\equiv\{\nu,\mathbf{M},\boldsymbol{\Sigma},\boldsymbol{\Omega}\}$ . From the development in the introduction of this section, for each $i=1,2,\ldots,n$ , let $\boldsymbol{S}_{i}$ be (unobserved) random matrices as per Equation (S-4) and Property 2. Then the expected complete log-likelihood function is

[TABLE]

E-step

Using Property 2 and properties of the Wishart distribution, the expectation step (E-step) updates at the current value $\boldsymbol{\Theta}^{(t)}$ of $\boldsymbol{\Theta}$ are, by taking the expected values of the $\boldsymbol{S}_{i}$ given the current value of $\boldsymbol{\Theta}^{(t)}$ :

[TABLE]

with $\psi_{p}(\cdot)$ as the $p$ -variate digamma function. Note that the updates for $\boldsymbol{S}_{i}^{(t+1)}$ exist by construction if the $\boldsymbol{\Sigma}$ and $\boldsymbol{\Omega}$ are positive definite. We define and store the expected sufficient statistics to reduce computational calculations and for notational convenience:

[TABLE]

with the last expression needed only when we are also estimating $\nu$ . In that case, these statistics can be expressed with $(\nu^{(t)}+p+q-1)$ factored out, and for convenience may be computed and stored that way when $\nu$ needs to be estimated. These quantities can be computed in $\boldsymbol{O}(npq^{2})+\boldsymbol{O}(np^{2}q)+\boldsymbol{O}(np^{3})$ flops.

Maximization step

Based on the updated weight matrices $\boldsymbol{S}_{i}^{(t+1)}$ and statistics based on $\Theta^{(t)}$ and $\boldsymbol{X}$ , we get the updates:

[TABLE]

This can be computed in $\boldsymbol{O}(p^{2}q)+\boldsymbol{O}(pq^{2})$ flops, which is negligible compared to the E-step computations. Again, treating the set of $\boldsymbol{S}_{i}^{(t+1)}$ as observed, the MLE of $\nu$ can be obtained:

[TABLE]

Defining $\kappa^{(t)}=\nu^{(t)}+p+q-1$ for compactness:

[TABLE]

where $\boldsymbol{Z}_{*}$ is the appropriate $\boldsymbol{S}_{*}$ statistic with $(\nu^{(t)}+p+q-1)$ factored out and $\psi_{p}$ is the $p$ -dimensional digamma function. This can be solved using a 1-dimensional search.

Since each $\boldsymbol{S}_{i}$ is positive definite by construction if the previous $\boldsymbol{\Sigma}^{(t)}$ and $\boldsymbol{\Omega}^{(t)}$ were positive definite, the updates $\widehat{\boldsymbol{\Sigma}}$ and $\widehat{\mathbf{M}}$ exist. The conditions for the positive definiteness of the update $\widehat{\boldsymbol{\Omega}}$ are less clear: it is the sum of matrices only guaranteed to be positive semi-definite and we do not have a proof of the necessary or sufficient sample size to guarantee the update is positive definite (a.s.) as required for the method. A solution for $\hat{\nu}$ is guaranteed to exist as long as $\widehat{\boldsymbol{\Sigma}}$ and $\widehat{\boldsymbol{\Omega}}$ exist and are positive definite.

ML Estimation with the Expectation/Conditional Maximization Either (ECME) algorithm

First we note that, if $\nu$ is known, there is no need to partition the M-step into multiple constrained maximization steps. If $\nu$ is required to be estimated, there is no difference between a standard EM and a standard ECM (Expectation/Conditional Maximization) algorithm in this setting, since, as in the case of the multivariate $t$ distribution, the complete data likelihood function factorizes into $\Theta_{1}=(\mathbf{M},\boldsymbol{\Sigma},\boldsymbol{\Omega})$ and $\Theta_{2}=(\nu)$ . However, by partitioning it in this way, it is possible, similarly to the case of the multivariate $t$ , to find a more efficient method of maximization. This is desirable because the M-step for $\nu$ can be slow. Here we present an ECME (Expectation/Conditional Maximization Either) algorithm that first maximizes the expected log-likelihood for $(\mathbf{M},\boldsymbol{\Sigma},\boldsymbol{\Omega})$ and then maximizes the actual log-likelihood over $\nu$ given the current values $(\mathbf{M},\boldsymbol{\Sigma},\boldsymbol{\Omega})$ , similar to [7].

Given $\Theta_{1}=(\mathbf{M},\boldsymbol{\Sigma},\boldsymbol{\Omega})$ , we can maximize for $\nu$ in Equation (1), yielding the set of equations provided in (S-5.21)

[TABLE]

The difference is that the solution for $\nu^{(t+1)}$ no longer depends on $\nu^{(t)}$ , Solving this equation is slightly more computationally complex than solving Equation (S-6) ( $\nu$ appears four times in the equation to be solved rather than once) but this converges in fewer total iterations. The ML estimating equation can be solved by a one-dimensional search, providing a ECME algorithm with the steps (as also provided in the main article):

E-step: Update $\boldsymbol{S}_{i}$ weights and statistics based on $\Theta^{(t)}$ and $\boldsymbol{X}$ . 2. 2.

CME-step: Update $\Theta_{1}^{(t+1)}=(\mathbf{M}^{(t+1)},\boldsymbol{\Sigma}^{(t+1)},\boldsymbol{\Omega}^{(t+1)})$ . 3. 3.

CME-step: Update $\Theta_{2}^{(t+1)}=\nu^{(t+1)}$ using the observed log-likelihood given the current values $(\mathbf{M}^{(t+1)},\boldsymbol{\Sigma}^{(t+1)},\boldsymbol{\Omega}^{(t+1)})$ by solving Equation (S-5.21).

Repeat these steps until convergence. Each iteration of this algorithm takes $\boldsymbol{O}(npq^{2})+\boldsymbol{O}(np^{2}q)+\boldsymbol{O}(np^{3})$ flops plus the number of iterations required by the second CME step.

S-5.22 Fitting with restrictions on the

parameters

In some settings, restrictions on the parametrization of the center or scatter matrices are appropriate. In this section, we derive solutions in the cases of center matrices that are constant across rows, columns, or the entire matrix. In [12] some results for restrictions on covariance matrices were derived and in this paper AR(1) covariance structures and compound symmetry (CS) variance structures were used; however, they were fit numerically as closed forms for the derivatives and determinants exist. Let $\mathbf{1}_{p,q}$ denote a $(p\times q)$ matrix consisting only of $1$ s. Then it can be shown that these are the appropriate M-step estimates for certain mean matrix constraints:

[TABLE]

which can be used to simplify the ECME algorithms further.

S-5.3 Performance Evaluations

S-5.31 Simulation Study

In the main paper, results pertaining to the recovery of the $\nu$ parameter were reported for a simulation study where 200 datasets were produced for $\nu=5,10,20$ and $n=35,50,100$ with a 0 mean matrix and identity scatter matrices. Here we report also the results for the recovery of the mean and covariance parameters. For $\boldsymbol{X}\sim t(\nu,\boldsymbol{M},\boldsymbol{\Sigma},\boldsymbol{\Omega})$ , we have the result that $\mathrm{cov}(\mathrm{vec}(\boldsymbol{X}))=\boldsymbol{\Sigma}\otimes\boldsymbol{\Omega}/(\nu-2)$ . To compare all nine sets of simulations on the same scale, we correct each by the appropriate scaling factor such that each has an identity covariance matrix and then report the root mean square difference between the actual and fitted $\hat{\boldsymbol{M}}$ and $\hat{\boldsymbol{\Sigma}}\otimes\hat{\boldsymbol{\Omega}}$ in Figure S-1. The figures indicate performance improves as the sample size increases and indicates good recovery of the parameters in every case.

We provide a second simulation study to address concerns about model misspecification, namely, what happens when a matrix $t$ distribution model is treated as a matrix normal or vice versa. Three datasets of size 100 with mean matrix 0 and parameters $\boldsymbol{\Sigma}$ a $5\times 5$ AR(1) matrix with $\rho=0.7$ and $\boldsymbol{\Omega}$ a draw from a standard Wishart distribution with $\nu=10$ and dimension 8, with one dataset from a MxV $t$ distribution with 6 degrees of freedom, one with 20 degrees of freedom, and one from a MxVN distribution.

In Figure S-2, we plot the log-likelihood, squared deviation from the mean, and the $L^{2}$ distance between the true and estimated covariance matrix. The top two rows indicate the results for the MxV $t$ with $\nu=6$ and $20$ and the bottom indicates the results for the MxVN, fitted to a MxVN and to MxV $t$ models with $\nu=3,4,\ldots,100$ . On the MxV $t$ with $\nu=6$ and $20$ , the MxVN performed poorly compared to the MxV $t$ with $\nu$ near the true parameter values. On the MxVN, the MxV $t$ performed poorly.

For all of the datasets, the MxVN has slightly worse recovery of the mean matrix than the MxV $t$ distributions while the MxVN had estimates of the covariance matrix that were comparable to the best MxV $t$ estimates. The $L^{2}$ norm of the covariance matrix was not accurate for low values of $\nu$ .

The behavior here is suggestive of what occurs in the results when the method fails to converge. Simulations that fail to converge slowly increase likelihood as $\nu$ increases until either the maximum number of iterations or the upper bound of $\nu$ is reached. This scenario occurs more frequently when simulating from distributions with large $\nu$ and small sample sizes or simulating from MxVN distributions with modest sample sizes (for larger sample sizes, even an MxVN will usually converge to some distribution with large $\nu$ ). As Figure S-2 indicates, the likelihood surface is very flat for a true MxVN across values of $\nu$ . With a small sample size and $\nu$ not small, this may occur there was well.

S-5.32 Matching Fractured Surfaces

The knife surfaces were scanned using a standard non-contact 3D optical interferometer in corresponding regions, then the 2D Fourier frequencies were computed and compared. In Figure LABEL:fig:knifeimg, we illustrate one pair of corresponding images (out of 9) from one of the knife base-tip pairs (out of 38). On the left are a visualization of the output of the 3D optical interferometer for the two surfaces. Note that the images are presented as-is - they should fit together when one is flipped over. The blue depressed region on the top corresponds to the red elevated region on the bottom. On the right is a visualization of the 2D Fourier transform with the frequency ranges used for comparison highlighted - the two bands between the “low frequency” and “high frequency” region.

Bibliography36

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] A. K. Gupta and D. K. Nagar, Matrix Variate Distributions . CRC Press, 1999, vol. 104.
2[2] C. Viroli, “Model based clustering for three-way data structures,” Bayesian Analysis , vol. 6, no. 4, pp. 573–602, 12 2011. [Online]. Available: https://doi.org/10.1214/11-BA 622 · doi ↗
3[3] L. Anderlucci and C. Viroli, “Covariance pattern mixture models for the analysis of multivariate heterogeneous longitudinal data,” The Annals of Applied Statistics , vol. 9, no. 2, pp. 777–800, 06 2015. [Online]. Available: https://doi.org/10.1214/15-AOAS 816 · doi ↗
4[4] J. M. Dickey, “Matricvariate generalizations of the multivariate t 𝑡 t distribution and the inverted multivariate t 𝑡 t distribution,” The Annals of Mathematical Statistics , vol. 38, no. 2, pp. 511–518, 04 1967. [Online]. Available: https://doi.org/10.1214/aoms/1177698967 · doi ↗
5[5] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society. Series B (methodological) , pp. 1–38, 1977.
6[6] X.-L. Meng and D. B. Rubin, “Maximum likelihood estimation via the ECM algorithm: A general framework,” Biometrika , vol. 80, no. 2, pp. 267–278, 1993. [Online]. Available: http://dx.doi.org/10.1093/biomet/80.2.267 · doi ↗
7[7] C. Liu and D. B. Rubin, “The ECME algorithm: A simple extension of EM and ECM with faster monotone convergence,” Biometrika , vol. 81, no. 4, pp. 633–648, 1994. [Online]. Available: http://www.jstor.org/stable/2337067
8[8] R Core Team, R: A Language and Environment for Statistical Computing , R Foundation for Statistical Computing, Vienna, Austria, 2019. [Online]. Available: https://www.R-project.org/

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Classification with the matrix-variate-ttt distribution

Abstract

Index Terms:

I Introduction

II Methodology

II-A Background and Preliminary Development

II-A1 The Matrix-variate Normal Distribution

Definition 1**.**

II-A2 The Matrix-variate ttt-distribution

Definition 2**.**

Properties:

II-B ML Estimation of the MxVttt parameters

II-C Discrimination and Classification

III Performance Evaluations

III-A Simulation Study

III-B Classification Examples

III-B1 Matching Fractured Surfaces

III-B2 Finger-tapping Experiment

III-B3 Landsat Satellite Data

III-B4 Cambridge Hand Gestures Data

IV Conclusions

Funding Information

Supplementary Materials

S-5 The EM algorithm for parameter estimation in the MxVttt distribution

S-5.1 ML Estimation of parameters with complete data

S-5.2 Estimating parameters from a MxVttt sample

S-5.21 The EM algorithm

E-step

Maximization step

ML Estimation with the Expectation/Conditional Maximization Either (ECME) algorithm

S-5.22 Fitting with restrictions on the

S-5.3 Performance Evaluations

S-5.31 Simulation Study

S-5.32 Matching Fractured Surfaces

Classification with the matrix-variate- $t$ distribution

Definition 1.

II-A2 The Matrix-variate $t$ -distribution

Definition 2.

II-B ML Estimation of the MxV $t$ parameters

S-5 The EM algorithm for parameter estimation in the MxV $t$ distribution

S-5.2 Estimating parameters from a MxV $t$ sample