Online Variational Bayesian Subspace Filtering with Applications

Charul; Uttkarsha Bhatt; Pravesh Biyani; Ketan Rajawat

arXiv:1906.09920·eess.SP·June 25, 2019

Online Variational Bayesian Subspace Filtering with Applications

Charul, Uttkarsha Bhatt, Pravesh Biyani, Ketan Rajawat

PDF

Open Access

TL;DR

This paper introduces a Bayesian subspace filtering method for dynamic data recovery that automatically determines model complexity and outperforms existing matrix completion techniques in real-world applications.

Contribution

It develops a variational Bayesian approach for time-varying subspace estimation with automatic relevance determination, reducing parameter tuning and computational complexity.

Findings

01

Outperforms state-of-the-art matrix/tensor completion algorithms

02

Effective in imputing missing data and rejecting outliers

03

Demonstrates superior temporal prediction on real datasets

Abstract

Matrix completion and robust principal component analysis have been widely used for the recovery of data suffering from missing entries or outliers. In many real-world applications however, the data is also time-varying, and the naive approach of per-snapshot recovery is both expensive and sub-optimal. This paper develops generative Bayesian models that fit sequential multivariate measurements arising from a low-dimensional time-varying subspace. A variational Bayesian subspace filtering approach is proposed that learns the underlying subspace and its state-transition matrix. Different from the plethora of deterministic counterparts, the proposed approach utilizes automatic relevance determination priors that obviate the need to tune key parameters such as rank and noise power. We also propose a forward-backward algorithm that allows the updates to be carried out at low complexity.…

Tables6

Table 1. TABLE I : Performance comparison for real time traffic estimation

	$p = 0.25$	$p = 0.50$	$p = 0.75$
	MRE	MRE	MRE
VBSF	0.1439	0.11277	0.09336
GROUSE	0.372	0.3446	0.3085
LRTC	0.1921	0.1418	0.09578
Mean	0.2083	0.2083	0.2083

Table 2. TABLE II : Comparison of running time for different algorithms 1

	$p = 0.25$	$p = 0.50$	$p = 0.75$
	time( $s e c$ )	time( $s e c$ )	time( $s e c$ )
VBSF	0.7001	0.8685	0.9675
GROUSE	0.7935	0.85324	0.923960
LRTC	2.92	4.32	6.23

Table 3. TABLE III : Performance comparison for traffic prediction

	$p = 0.50$	$p = 0.50$
	$15 m i n s$	$30 m i n s$
VBSF	0.15362	0.17434
LRTC	0.15843	0.1812
Mean	0.2082	0.2073

Table 4. TABLE IV : RVBSF: overall performance

	$c = 0.75$	$c = 0.75$	$c = 1.5$
	$p_{o} = 5$ %	$p_{o} = 2$ %	$p_{o} = 2$ %
VBSF	0.09462	0.09457	0.09434
VBSF_outlier	0.13406	0.11643	0.15318
RVBSF	0.11741	0.1127	0.10912

Table 5. TABLE V : Performance Comparison for Robust Traffic Estimation

	$c = 0.75$	$c = 0.75$	$c = 1.5$
	$p_{o} = 5$ %	$p_{o} = 2$ %	$p_{o} = 2$ %
OP-RPCA	0.2594	0.2298	0.2165
ROSETA	0.1859	0.1819	0.1723
GRASTA	0.1493	0.1507	0.1492
RVBSF	0.11741	0.1127	0.10912

Table 6. TABLE VI : Electricity real time load prediction

	$p = 0.25$ %	$p = 0.5$ %	$p = 0.75$ %
MRE	0.1789	0.101	0.0987
MAE(kW)	96.95	66.67	53.95

Equations107

p (y_{i τ} ∣ a_{i \cdot}, b_{τ}, β)

p (y_{i τ} ∣ a_{i \cdot}, b_{τ}, β)

p (b_{τ} ∣ J, b_{τ - 1})

p (b_{τ} ∣ J, b_{τ - 1})

p (B ∣ J) = N (b_{1}; μ_{1}, Λ_{1}) τ = 2 \prod t N (b_{τ} ∣ J b_{τ - 1}, I_{r}) .

p (B ∣ J) = N (b_{1}; μ_{1}, Λ_{1}) τ = 2 \prod t N (b_{τ} ∣ J b_{τ - 1}, I_{r}) .

p (A ∣ γ)

p (A ∣ γ)

p (J ∣ υ)

p (β)

p (β)

p (y_{Ω}, H)

p (y_{Ω}, H)

= τ = 1 \prod t i \in Ω_{τ} \prod N (y_{i τ} ∣ b_{τ}^{T} a_{i \cdot}, β^{- 1})

\times i = 1 \prod r [N (a_{i} ∣ 0, γ_{i}^{- 1} I_{m}) N (j_{i} ∣ 0, υ_{i}^{- 1} I_{r})]

\times N (b_{1}; μ_{1}, Λ_{1}) τ = 2 \prod t N (b_{τ} ∣ J b_{τ - 1}, I_{r}) \frac{1}{β} i = 1 \prod r \frac{1}{γ _{i} υ _{i}}

p (H ∣ y_{Ω}) \approx q (H) = q_{A} (A) q_{B} (B) q_{J} (J) q_{υ} (υ) q_{β} (β) q_{γ} (γ) .

p (H ∣ y_{Ω}) \approx q (H) = q_{A} (A) q_{B} (B) q_{J} (J) q_{υ} (υ) q_{β} (β) q_{γ} (γ) .

q_{B} (B)

q_{B} (B)

q_{a_{i \cdot}}

q_{j_{i \cdot}}

q_{β} (β)

q_{γ_{i}} (γ_{i})

q_{υ_{i}} (υ_{i})

\overset{υ}{^}_{i}

\overset{υ}{^}_{i}

\overset{γ}{^}_{i}

Σ_{τ, ι}^{B}

Σ_{τ, ι}^{B}

Σ_{i}^{J}

Σ_{i}^{A}

\hat{β} = \frac{ω}{\sum _{τ = 1}^{t} \sum _{i \in Ω_{τ}} [ y _{i τ}^{2} - 2 y _{i τ} ( μ _{i}^{A} ) ^{T} μ _{τ}^{B} + tr ( Σ _{i}^{A} Σ _{τ, τ}^{B} ) ]} .

\hat{β} = \frac{ω}{\sum _{τ = 1}^{t} \sum _{i \in Ω_{τ}} [ y _{i τ}^{2} - 2 y _{i τ} ( μ _{i}^{A} ) ^{T} μ _{τ}^{B} + tr ( Σ _{i}^{A} Σ _{τ, τ}^{B} ) ]} .

μ_{i}^{J}

μ_{i}^{J}

Ξ_{i}^{J}

μ_{i}^{A}

Ξ_{i}^{A}

μ^{B}

μ^{B}

[Ξ^{B}]^{- 1}

[Ξ^{B}]^{- 1}

+ Λ_{1}^{- 1} - \hat{J} ⋮ \dots - \hat{J} I_{r} + Σ^{J} ⋮ 0 \dots - \hat{J} - \hat{J} 0 \dots ⋮ I_{r} .

Ψ_{τ}

Ψ_{τ}

v_{τ}

\overset{˘}{Ξ}_{τ, τ + 1}^{B}

\overset{˘}{Ξ}_{τ, τ + 1}^{B}

\overset{˘}{Ξ}_{τ + 1, τ + 1}^{B}

\overset{˘}{μ}_{τ + 1}^{B}

Ξ_{τ, τ + 1}^{B}

Ξ_{τ, τ + 1}^{B}

Ξ_{τ, τ}^{B}

μ_{τ}^{B}

y_{i τ} = (μ_{τ}^{B})^{T} μ_{i}^{A} .

y_{i τ} = (μ_{τ}^{B})^{T} μ_{i}^{A} .

y_{i τ} = (\hat{J}^{τ - t} μ_{t}^{B})^{T} μ_{i}^{A} .

y_{i τ} = (\hat{J}^{τ - t} μ_{t}^{B})^{T} μ_{i}^{A} .

Q (H_{p}, H_{p}^{(ι)}) := E_{q_{h} (H_{h})} [lo g p (y_{Ω}, H_{h}, H_{p}^{(ι)})]

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBlind Source Separation Techniques · Anomaly Detection Techniques and Applications · Traffic Prediction and Management Techniques

Full text

Online Variational Bayesian Subspace Filtering with Applications

Charul 1, Student Member, IEEE, Uttkarsha Bhatt2 , Pravesh Biyani1, Member, IEEE and Ketan Rajawat3, Member, IEEE 1The authors are with the Department of Electronics and Communication Engineering, Indraprastha Institute of Information Technology, Delhi, India.3K. Rajawat is with the Department of Electrical Engineering, Indian Institute of Technology, Kanpur, UP, India.

Abstract

Matrix completion and robust principal component analysis have been widely used for the recovery of data suffering from missing entries or outliers. In many real-world applications however, the data is also time-varying, and the naive approach of per-snapshot recovery is both expensive and sub-optimal. This paper develops generative Bayesian models that fit sequential multivariate measurements arising from a low-dimensional time-varying subspace. A variational Bayesian subspace filtering approach is proposed that learns the underlying subspace and its state-transition matrix. Different from the plethora of deterministic counterparts, the proposed approach utilizes automatic relevance determination priors that obviate the need to tune key parameters such as rank and noise power. We also propose a forward-backward algorithm that allows the updates to be carried out at low complexity. Extensive tests over traffic and electricity data demonstrate the superior imputation, outlier rejection, and temporal prediction prowess of the proposed algorithm over the state-of-the-art matrix/tensor completion algorithms.

I Introduction

Sensor measurements are often incomplete, noisy, and replete with outliers arising due to malfunctions or intermittent errors. Imputation of the missing entries and removal/segregation of the outliers is a critical first step that must be carried out prior to any data analytics. Examples of applications that benefit from such a pre-processing step include estimation/prediction of city-wide road traffic, regional air quality, electricity consumption in power distribution networks and foreground-background separation in videos. For most of these applications, the measurements can be arranged in form of a matrix, some of whose entries may be missing or contaminated with outliers. Pertinent approaches model the measurements as arising from a low-dimensional subspace whose recovery allows us to reject the noise and outliers, and impute the missing entries [1, 2, 3, 4, 5, 6].

Many real-world applications, including the aforementioned ones, involve time-varying data that arrives in a sequential manner and must be processed as such. As a result, the data matrices arising in such applications comprise of low-dimensional subspaces that evolve over time. While the classical matrix completion or robust principal component analysis (RPCA) approaches are still applicable to each snapshot of the data, the performance can generally be improved by exploiting the temporal correlations present in the measurements [7, 2, 8, 9, 10]. State-of-the-art approaches for processing time-varying subspaces can mostly be classified into approaches based on tensor completion [7] and regularized matrix completion [11]. A common feature of these techniques is their static perspective and the resulting focus on batch processing. In contrast however, the data streaming from the sensors may be inherently dynamic, arising from subspaces that evolve over time. Theoretical guarantees for the dynamic setting have been studied in [12]. Different from these approaches and closer to the classical time-series modeling, an online forecasting matrix completion approach was proposed in [8] where the underlying subspace was assumed to follow a linear state-space model and must be learned in an online fashion. Approaches based on matrix completion often involve a number of tuning parameters that must be correctly set in order to avoid over-fitting. However determining these parameters via cross-validation is quite challenging with time-series data, especially in the online setting [8]. Alternatively, probabilistic learning algorithms have been proposed for the static matrix completion, and are generally free of tuning parameters. Such approaches entail constructing generative models that are not only capable of modeling the data but are also simple enough to allow low-complexity updates.

This work considers the first low-rank robust subspace filtering approach for online matrix imputation and prediction. Different from the existing matrix and tensor completion formulations, we consider low-rank matrices whose underlying subspace evolves according to a state-space model. As incomplete columns of the data matrix arrive sequentially over time, the low rank components as well as the state-space model are learned in an online fashion using the variational Bayes formalism. In particular, component distributions are chosen to allow automatic relevance determination (ARD) and unlike the matrix or tensor completion works, the algorithm parameters such as rank, noise powers, and state noise powers need not be specified or tuned. A low-complexity forward-backward algorithm is also proposed that allows the updates to be carried out efficiently. Enhancements to the proposed algorithm, capable of learning time-varying state-transition matrices, operating with a fixed lag, and robust to outliers, are also detailed. Our approach is general and we demonstrate its efficacy on various settings. In particular, we discuss the traffic estimation problem in detail and show that the variational Bayesian approach can be used to impute road traffic densities in an online fashion and from only a few observations. As the proposed models are generative, the resulting traffic density predictions can also be used to obtain accurate expected time-of-arrival (ETA) estimates. Additionally, the applicability of the proposed algorithm on the electricity load estimation and prediction problem is also shown. The superior performance of our algorithm vis-a-vis other state of the art subspace tracking and online matrix factorisation algorithms may be attributed to the proposed state space model as well as the flexibility in the data modeling provided by the variational Bayesian approach. In summary, the contributions of the present work are as follows:

We present the variational Bayesian subspace filtering (VBSF) algorithm and demonstrate its ability to perform data modeling, imputation and temporal prediction in an online setting wherein the key algorithmic parameters are automatically tuned. 2. 2.

Robust version of the VBSF algorithm is also proposed for outlier removal and data cleansing. 3. 3.

Finally, we report a comprehensive comparison of our algorithm with various relevant (offline) matrix completion as well as online subspace estimation and tracking techniques, e.g, GROUSE [2], Low Rank Tensor Completion (LRTC) [7], GRASTA [9], ROSETA [10], OP-RPCA [13] and Online Forecasting Matrix Factorisation (OFMF) [14] over real-world traffic speed data as well as the electricity load data.

I-A Related work

Variational Bayesian approaches for matrix completion and robust principal component analysis are well known [3, 15, 16, 17, 18, 19, 20, 4, 21]. One of the first works considered the measured matrix to be expressible as a product of low-rank matrices, associated with appropriate ARD priors [3] while faster algorithms for similar settings were proposed in [15, 16]. More recently, other approaches towards modeling the measured matrices have also been proposed [17], [18]. Moreover, variational Bayesian approaches have also been applied to road traffic estimation; see e.g. [19]. However, these approaches do not explicitly model the evolution of the underlying subspace. Likewise, none of the existing variational Bayesian approaches for low rank matrix completion model the evolution of the subspace [3, 21, 18]. In contrast to these, the state-space modeling in our work is inspired from [20], where the low-complexity updates were first proposed in the context of linear dynamical models. The VBSF algorithm in the current work extends and generalizes that in [20] to incorporate low-rank structure and outliers.

On a related note, temporal evolution of the additive noise is modeled in [4] using a forgetting factor. Different from [4] however, we use a state-space model to capture the evolution of the underlying subspace. An online Bayesian matrix factorization model is also proposed in [13] wherein the time-stamps are directly incorporated as features. In contrast, the present model is more specific and suited to a slowly time-varying system.

Several non-Bayesian algorithms have been proposed to address the online subspace estimation problem from incomplete observations[2, 9, 13, 10]. GROUSE [2] is one of the early approaches that uses an update on the Grassmannian manifold to estimate the subspace. The robust variant of GROUSE, namely GRASTA , handles outliers by by incorporating the $l_{1}$ norm cost function[9]. OP-RPCA[13] is a robust subspace estimation technique that uses alternating minimization to compute the outliers and the underlying subspace. A number of online subspace tracking algorithms, such as ROSETA [10], have since been proposed. The proposed approach is compared with some of these algorithms in Sec. IV.

I-B Applications:

I-B1 Traffic Estimation and Prediction

Traffic estimation and prediction are the central components of any urban traffic congestion management system [22]. With the advent of smartphones, public transportation services as well as private on-demand transportation companies are increasingly relying on the availability of real-time traffic maps for resource allocation and logistics [23]. Such providers rely on probe vehicles — GPS enabled and possibly crowd-sourced agents that upload speed measurements and corresponding location tags at sporadic times. Since traffic densities are inferred from speed measurements, they are often ridden with outliers, e.g., corresponding to random velocity changes unrelated to traffic. The traffic estimation problem entails estimating traffic densities at locations and times where no measurements are available. Finally, prediction of traffic in the near future is necessary to calculate ETA, fastest route, and other related quality of service metrics for road users. The future traffic prediction problem becomes particularly challenging in regions with diverse modes of transport, such as in India, where ETA calculations must account for the multimodal nature of traffic [24, 25]. For instance the ETA calculations for buses should not only use traffic data meant for cars.A class of pertinent approaches have sought to visualize the traffic data as an incomplete matrix or tensor, and exploited this correlation to fill-in the missing entries [26, 27, 28, 19]. Complementary to these approaches, time-series modeling focuses on learning the temporal dynamics of traffic and generate predictions in an online manner [29]. While recent variants have incorporated spatial correlations as well, these techniques are generally unable to handle missing data or outliers. Finally, [14] presents the online forecasting matrix factorisation algorithm on the time series data that also handles the missing data scenario.

I-B2 Electricity Load Estimation and Prediction

Similar to the traffic data, the electricity load data also exhibits the spatial and temporal structure that can be exploited to impute the missing data while simultaneously removing the noisy outliers. Due to the environmental disturbance, communication error or sensor fault, it is inevitable that load data may be lost during the collection process [30].

This paper is organized as follows. Sec. II presents the online variational Bayesian subspace filtering method for traffic estimation and prediction. Sec. III presents the online robust variational Bayesian subspace filtering method for traffic estimation and prediction in case of outliers. Results and findings for traffic prediction and electricity load prediction are discussed in Sec. IV followed by conclusion in Sec. V.

Notation: Scalars are denoted by letters in regular font, while vectors (matrices) are denoted by bold face (capital) letters. For a matrix $\mathbf{A}$ , its transpose and trace are denoted by $\mathbf{A}^{T}$ and $\text{tr}\left(\mathbf{A}\right)$ , respectively. The $(i,j)$ -th element of a matrix $\mathbf{A}$ is denoted by $a_{ij}$ , the $i$ -th column by $\mathbf{a}_{i}$ or $[\mathbf{A}]_{\bm{\cdot}i}$ , and the $i$ -th row by $\mathbf{a}_{i\bm{\cdot}}^{T}$ or $[\mathbf{A}]_{i\bm{\cdot}}^{T}$ . The all-one vector of size $n\times 1$ is represented by $\mathbf{1}_{n}$ , while $\mathbf{I}_{n}$ denotes identity matrix of size $n\times n$ . The Frobenius norm for a matrix $\mathbf{A}$ and the Euclidean norm for a vector $\mathbf{a}$ are denoted by $\left\|\mathbf{A}\right\|$ and $\left\|\mathbf{a}\right\|$ , respectively. The multivariate Gaussian probability density function (pdf) with mean vector $\bm{\mu}$ and covariance matrix $\bm{\Sigma}$ evaluated at $\mathbf{x}\in\mathbb{R}^{n}$ is denoted by $\mathcal{N}(\mathbf{x}\mid\bm{\mu},\bm{\Sigma})$ . Likewise, Ga $(x,a,b)$ denotes the Gamma pdf with parameters $a_{x}$ and $b_{x}$ evaluated at $x\in\mathbb{R}_{+}$ . The expectation operator is symbolized by $\mathbb{E}$ while the pdf is generically denoted by $p(\cdot)$ . Given data $\mathbf{D}$ , the posterior mean is given by $\hat{\mathbf{x}}:=\mathbb{E}[\mathbf{x}\mid\mathbf{D}]$ .

II Variational Bayesian Subspace Filtering

We consider a scenario where the data with the missing entries is arriving in a sequential manner. The data can be considered in the form of the matrix $\mathbf{Y}\in\mathbb{R}^{m\times t}$ , where $t$ denotes the number of time instances over which measurements are made and $m$ denotes the number of rows of the matrix $\mathbf{Y}$ . More generally, $\mathbf{Y}$ is an incomplete and growing matrix whose columns arrive sequentially over time. Specifically, for each column $\mathbf{y}_{\tau}$ with $1\leq\tau\leq t$ , only entries from the index set $\Omega_{\tau}\subset\{1,\ldots,m\}$ are observed. The algorithms developed here will seek to achieve the following two goals:

•

imputation which yields $\{\hat{y}_{i\tau}\}_{i\notin\Omega_{\tau}}$ for $1\leq\tau\leq t$ , and

•

prediction which yields $\{\hat{\mathbf{y}}_{t+\tau}\}_{\tau=1}^{T_{p}}$ where $T_{p}$ is the prediction horizon.

The next subsection develops a variational Bayesian algorithm for achieving the aforementioned goals.

II-A Hierarchical Bayesian Model

We begin with detailing a generative model for the matrix $\mathbf{Y}$ . The proposed model will not only capture the rank deficient nature of $\mathbf{Y}$ [31] but also the temporal correlation between successive columns of $\mathbf{Y}$ [32]. Recall that the standard low-rank parametrization of the full matrix $\mathbf{Y}$ takes the form $\mathbf{Y}=\mathbf{A}\mathbf{B}$ where $\mathbf{A}\in\mathbb{R}^{m\times r}$ and $\mathbf{B}\in\mathbb{R}^{r\times t}$ . Classical non-negative matrix completion approaches seek to obtain such a factorization. In such algorithms, the choice of $r$ is critical to avoiding underfitting or overfitting.

Within the Bayesian setting however, the measurements are modeled as arising from a distribution with unknown hyper-parameters, while various components or parameters are assigned different prior distributions. The Bayesian framework allows the use of ARD, wherein associating appropriate priors to the model parameters leads to pruning of the redundant features [31]. This work uses pdfs from the exponential family that allow for tractable forms of the posterior pdf but are also flexible enough to adequately model the data.

Specifically, the entries of $\mathbf{Y}$ are generated as

[TABLE]

for all $\tau\geq 1$ , where $\mathbf{A}\in\mathbb{R}^{m\times r}$ , $\mathbf{B}\in\mathbb{R}^{r\times t}$ , and $\beta\in\mathbb{R}_{++}$ are the (hidden) problem parameters. Unlike the deterministic setting however, the rank hyper-parameter $r$ is not critical to the imputation or prediction accuracy, but is only required to chosen according to computational considerations. The temporal evolution of the entries of $\mathbf{Y}$ is modeled by making the columns of $\mathbf{B}$ adhere to the following first order autoregressive model:

[TABLE]

for $\tau\geq 2$ , where $\mathbf{J}\in\mathbb{R}^{r\times r}$ is again a problem parameter. Here, $\mathbf{J}$ captures the temporal structure of the underlying subspace, and is learned from the data itself. The scaling ambiguity present in matrix factorization allows the transition matrix $\mathbf{J}$ to capture both slow and fast variations in $\mathbf{b}_{\tau}$ without the need to explicitly model the state noise variance. It follows from (2) that the conditional pdf of $\mathbf{b}_{\tau}$ given $\mathbf{J}$ is given by

[TABLE]

Observe that the model complexity depends on the rank $r$ , which is also the number of columns in $\mathbf{A}$ and $\mathbf{J}$ . In order to ensure the value of $r$ is learned in a data-driven fashion, the columns of $\mathbf{A}$ and $\mathbf{J}$ are assigned multivariate Gaussian priors with column-specific precisions, i.e.,

[TABLE]

where the precisions $\bm{\gamma}$ and $\bm{\upsilon}$ are problem parameters. It can be seen that if any of $\gamma_{i}$ or $\upsilon_{i}$ are large, the corresponding columns will be close to zero and consequently irrelevant. Indeed, the priors in (4)-(5) aid in automatic relevance determination since the subsequent optimization process may drive some of the precisions to infinity, yielding a low-rank factorization.

Finally, the three precision variables are selected to have have non-informative Jeffrey’s priors

[TABLE]

for $1\leq i\leq r$ . Let $\mathbf{y}_{\Omega}$ denote the collection of measurements $\{y_{i\tau}\}_{i\in\Omega_{\tau},\tau=1}^{t}$ . Collecting the hidden variables into $\mathcal{H}:=\{\mathbf{A},\mathbf{B},\mathbf{J},\beta,\bm{\gamma},\bm{\upsilon}\}$ , the joint distribution of $\{\mathbf{y}_{\Omega},\mathcal{H}\}$ can be written as

[TABLE]

The full hierarchical Bayesian model adopted here is summarized in Fig. 2(a).

II-B Variational Bayesian Inference

Having specified the generative model for the data, the goal is to determine the posterior distribution $p(\mathcal{H}|\mathbf{y}_{\Omega})$ , which would yield the corresponding point estimates and can be used for imputation and prediction tasks. However, exact full Bayesian inference is well-known to be intractable. Instead, we utilize the mean-field approximation, wherein the posterior distribution factorizes as:

[TABLE]

In other words, the posterior is now restricted to a family of distributions that adhere to (8). The factors $q_{\mathbf{A}}$ , $q_{\mathbf{B}}$ , $q_{\mathbf{J}}$ , $q_{\upsilon}$ , $q_{\beta}$ , and $q_{\gamma}$ can be determined by minimizing the Kullback–Leibler divergence of $p(\mathcal{H}|\mathbf{y}_{\Omega})$ from $q(\mathcal{H})$ , usually via an alternating minimization approach [33]. Indeed, thanks to the choice of conjugate priors for the parameters, it can be shown that the individual factors in (8) take the following forms [20]:

[TABLE]

where, $\bm{\mu}^{\mathbf{B}}\in\mathbb{R}^{rt}$ , $\bm{\Xi}^{\mathbf{B}}\in\mathbb{R}^{rt\times rt}$ , $\bm{\mu}^{\mathbf{A}}_{i}\in\mathbb{R}^{r}$ , $\bm{\Xi}^{\mathbf{A}}_{i}\in\mathbb{R}^{r\times r}$ , $\bm{\mu}^{\mathbf{J}}_{i}\in\mathbb{R}^{r}$ , $\bm{\Xi}^{\mathbf{J}}_{i}\in\mathbb{R}^{r\times r}$ , and $a^{\beta}$ , $b^{\beta}$ , $a^{\gamma}_{i}$ , $b^{\gamma}_{i}$ , $a^{\upsilon}_{i}$ , $b^{\upsilon}_{i}\in\mathbb{R}_{++}$ . Consequently, each iteration of alternating optimization simply involves updating the variables $\{\bm{\mu}^{\mathbf{B}},\bm{\Xi}^{\mathbf{B}},\{\bm{\mu}^{\mathbf{A}}_{i}\},\{\bm{\Xi}^{\mathbf{A}}_{i}\},\{\bm{\mu}^{\mathbf{J}}_{i}\},\{\bm{\Xi}^{\mathbf{J}}_{i}\}$ , $a^{\beta},b^{\beta}$ , $\{a^{\gamma}_{i}\},\{b^{\gamma}_{i}\},\{a^{\upsilon}_{i}\},\{b^{\upsilon}_{i}\}\}$ in a cyclic manner.

In the present case, not all variables need to be updated explicitly and the updates may be written in a compact form. Let us denote $\omega_{\tau}:=\lvert\Omega_{\tau}\rvert$ and let $\omega:=\sum_{\tau}\omega_{\tau}$ be the total number of observations made. Then, the updates for hyperparameters $\{\bm{\upsilon},\bm{\gamma}\}$ take the following form

[TABLE]

Subsequently, let $\hat{\bm{\upsilon}}$ and $\hat{\bm{\gamma}}$ be the vectors that collect $\{\hat{\upsilon}_{i}\}$ and $\{\hat{\gamma}_{i}\}$ , respectively. Since $\mathbf{b}_{\tau}$ denotes the $\tau$ -th column of $\mathbf{B}^{T}$ , its posterior distribution may be written as $q_{\mathbf{b}_{\tau}}(\mathbf{b}_{\tau})=\mathcal{N}(\mathbf{b}_{\tau}\mid\bm{\mu}^{\mathbf{B}}_{\tau},\bm{\Xi}^{\mathbf{B}}_{\tau})$ , where $\bm{\mu}^{\mathbf{B}}_{\tau}$ and $\bm{\Xi}^{\mathbf{B}}_{\tau}$ comprise of the corresponding elements of $\bm{\mu}^{\mathbf{B}}$ and $\bm{\Xi}^{\mathbf{B}}$ , respectively. Also define the posterior covariance matrices

[TABLE]

Therefore, the update for $\hat{\beta}$ becomes

[TABLE]

Next, the updates for the factors $\mathbf{J}$ and $\mathbf{A}$ take the following form

[TABLE]

where $\Omega^{\prime}_{i}:=\{\tau\mid i\in\Omega_{\tau}\}$ . Observe from the updates that the rows of $\mathbf{J}$ are independent identically distributed under the mean field approximation. The update for $\bm{\mu}^{\mathbf{B}}$ can be written as

[TABLE]

Finally, $[\bm{\Xi}^{\mathbf{B}}]^{-1}$ a block-tridiagonal matrix. Defining $\hat{\mathbf{J}}:=\mathbb{E}[\mathbf{J}\mid\mathbf{y}_{\Omega}]$ as the matrix whose $i$ -row is given by $(\bm{\mu}^{\mathbf{J}}_{i})^{T}$ , $\bm{\Sigma}^{\mathbf{A}}_{(\tau)}=\sum_{i\in\Omega_{\tau}^{\prime}}\bm{\Sigma}^{\mathbf{A}}_{i}$ , and $\bm{\Sigma}^{\mathbf{J}}:=\sum_{i=1}^{r}\bm{\Sigma}^{\mathbf{J}}_{i}$ , the updates take the form:

[TABLE]

It is remarked that although the $rt\times rt$ matrix $[\bm{\Xi}^{\mathbf{B}}]^{-1}$ is block-tridiagonal, the matrix $\bm{\Xi}^{\mathbf{B}}$ is dense, and direct inversion would be prohibitively costly. Moreover, the classical Rauch-Tung-Striebel (RTS) smoother cannot be directly applied since evaluating the conditional expectations under $q(\mathbf{B})$ is difficult and not amenable to the Matrix Inversion Lemma [34]. Interestingly, observe that the updates in (14) and (15) depend only on diagonal and super-diagonal blocks of $\bm{\Xi}^{\mathbf{B}}$ , namely $\bm{\Xi}^{\mathbf{B}}_{\tau,\tau}$ and $\bm{\Xi}^{\mathbf{B}}_{\tau,\tau-1}$ , respectively. The next subsection details a low-complexity algorithm for carrying out the updates for these blocks as well as for $\bm{\mu}^{\mathbf{B}}$ .

II-C Low-complexity updates via LDL-decomposition

Thanks to the block-tridiagonal structure of $[\bm{\Xi}^{\mathbf{B}}]^{-1}$ , it is possible to use the LDL decomposition to carry out the updates in an efficient manner. Decomposing $[\bm{\Xi}^{\mathbf{B}}]^{-1}=\mathbf{L}\mathbf{D}\mathbf{L}^{T}$ , the key idea is that left multiplication with $\bm{\Xi}^{\mathbf{B}}$ is equivalent to left multiplication with $\mathbf{L}^{-T}\mathbf{D}^{-1}\mathbf{L}^{-1}$ . Towards this end, we utilize the algorithm from [20], that comprises of two phases: the forward pass that carries out the multiplication with $\mathbf{D}^{-1}\mathbf{L}^{-1}$ and the backward pass that implements the multiplication with $\mathbf{L}^{-T}$ . Let us define for $2\leq\tau\leq t$ ,

[TABLE]

The forward pass outputs intermediate variables $\breve{\bm{\Xi}}^{\mathbf{B}}_{\tau,\tau}$ , $\breve{\bm{\Xi}}^{\mathbf{B}}_{\tau,\tau+1}$ , and $\breve{\bm{\mu}}_{\tau}$ , that are subsequently used in the backward pass. The updates take the following form:

Initialize $\hat{\bm{\Xi}}^{\mathbf{B}}_{1,1}=\bm{\Lambda}_{1}$ and $\hat{\bm{\mu}}^{\mathbf{B}}_{1}=\bm{\mu}_{1}+\hat{\beta}\sum_{i\in\Omega_{\tau}}y_{i\tau}\bm{\Lambda}_{1}\bm{\mu}^{\mathbf{A}}_{i}$ 2. 2.

For $\tau=1,\ldots,t-1$

[TABLE] 3. 3.

For $\tau=t-1,\ldots,1$

[TABLE] 4. 4.

Output $\{\bm{\Xi}^{\mathbf{B}}_{\tau,\tau+1},\bm{\Xi}^{\mathbf{B}}_{\tau,\tau},\bm{\mu}^{\mathbf{B}}_{\tau}\}_{\tau=2}^{t}$

Note that while $\bm{\Xi}^{\mathbf{B}}_{i,j}\neq 0$ for $|i-j|>1$ , these blocks are neither calculated in the forward and backward passes nor required in any of the variational updates.

Finally, the predictive distribution $p(y_{i\tau}\mid\mathbf{y}_{\Omega})$ for $\tau\notin\Omega_{i}$ or $\tau\geq t+1$ is still not tractable in the present case. Instead, we simply use point estimates for estimating the missing entries. Specifically, for $\tau\notin\Omega_{i}$ , the missing entries are imputed as

[TABLE]

Likewise for $\tau\geq t+1$ , the prediction becomes

[TABLE]

It can be seen that as compared to the updates in (16)-(17) that incur a complexity of $\mathcal{O}(t^{3})$ , the complexity incurred due to (20) is only $\mathcal{O}(t)$ . Overall, the different parameters are updated cyclically until convergence for each $t=1,2,\ldots$ .

II-D EM Baysian Subspace Filtering

Different from the variational Bayesian framework used here, the EM algorithm treats $\mathcal{H}_{h}:=\{\mathbf{A},\mathbf{B},\mathbf{J}\}$ as hidden variables (with posterior pdf $q_{h}(\mathcal{H}_{h}):=q_{\mathbf{B}}(\mathbf{B})q_{\mathbf{A}}(\mathbf{A})q_{\mathbf{J}}(\mathbf{J})$ ) and uses maximum a posteriori (MAP) estimates for the precision variables $\mathcal{H}_{p}:=\{\bm{\upsilon},\bm{\gamma},\beta\}$ . Consequently, the EM algorithm for Bayesian subspace tracking starts with an initial estimate $\mathcal{H}_{p}^{(0)}$ and uses the following updates at iteration $\iota\geq 1$ ,

•

E-step: evaluate

[TABLE]

•

M-step: maximize

[TABLE]

Interestingly, the updates resulting from the E-step take the same form as those in (15) and (20). On the other hand, the updates obtained from solving the M-step take the slightly different form:

[TABLE]

The slight differences arise due to the difference between the mean and mode of the Gamma distribution. Specifically, for $p(x)=$ Ga $(x|a,b)$ , it holds that $\mathbb{E}[X]=a/b$ while $\max_{x}\text{Ga}(x|a,b)=\frac{a-1}{b}$ .

II-D1 Remarks on the Convergence of VBSF

The VB framework used in the present work is a special case of a more general mean field approximation approach. The convergence of the VB algorithm is well-known; see e.g. [35], [36]. Intuitively, the variational approximation renders the evidence lower bound convex in individual factors, and thus amenable to coordinate ascent iterations. Since the lower bound is also differentiable with respect to each factor, the coordinate ascent iterations converge to a stationary point; see [37] for a more general result. However, convergence to the global optimum is not guaranteed.

II-E Fixed-lag tracking

Algorithm 1 can be viewed as an offline algorithm that must be run for every $t$ . In practical settings, it may be impractical to remember and process the entire history of measurements at each $t$ . Moreover, given data at time $t$ , estimates may only be required for entries at time $t-\Delta$ for some $\Delta<h$ . Towards this end, we consider a sliding window of measurements. Since $\mathbf{A}_{t}$ and $\mathbf{J}_{t}$ may be seen as transition matrices for the latent states and between latent state and observations, we initialize the next sliding-window with inferred approximate distributions on the transition matrices of the current window. For instance, within the context of traffic density prediction, the inferred approximate distribution for a day may be used as a prior for the coming days. That is, the distributions for $\mathbf{A},\,\mathbf{B},$ and $\mathbf{J}$ for a day and sliding window can be initialized with the approximate distributions obtained from the previous month’s data.

III Robust Variational Bayesian Subspace Filtering

In this section we consider the robust version of the variational Bayesian subspace filtering problem in Sec. II. Within this context, in addition to the missing entries in $\mathbf{Y}$ , some entries of $\mathbf{Y}$ are also contaminated with outliers. Unlike the missing entries however, the location of these outliers is not known. These entries arise due to sensor malfunctions, communication errors, and impulse noise. The robust subspace filtering problem is more difficult as the removal of such outliers entails estimating their magnitudes as well as locations.

Within the deterministic robust PCA framework, the matrix is modeled as taking the form $\mathbf{Y}=\mathbf{A}\mathbf{B}+\mathbf{E}$ where $\mathbf{A}\in\mathbb{R}^{m\times r}$ , $\mathbf{B}\in\mathbb{R}^{r\times t}$ are low-rank matrices as before. Additionally, we also need to estimate the sparse outlier matrix $\mathbf{E}\in\mathbb{R}^{m\times t}$ . As before, both $r$ and the level of sparsity in $\mathbf{E}$ are tuning parameters that must generally be carefully selected.

Here, we put forth the variational Bayesian subspace filtering algorithm that makes use of ARD priors to prune the redundant features. Consider the measurement matrix $\mathbf{Y}$ , whose entries are generated from the following pdf:

[TABLE]

for all $\tau\geq 1$ , and apart from the matrices $\mathbf{A}$ and $\mathbf{B}$ defined earlier, we also have $\{e_{i\tau}\}_{\tau=1,i\in\Omega_{\tau}}^{t}$ as the additional (hidden) problem parameter that captures the outliers. The generative models for $\mathbf{A}$ and $\mathbf{B}$ are the same as before, i.e.,

[TABLE]

for $\tau\geq 2$ , and $\bm{\gamma}$ and $\bm{\upsilon}$ are problem parameters. Additionally, we also associate an ARD prior to the outliers, i.e.,

[TABLE]

for $1\leq\tau\leq t$ , where the precision $\alpha_{i\tau}$ is a hidden variable, that would be driven to infinity whenever $e_{ij}$ is zero. It is remarked that the prior for $e_{i\tau}$ is only specified for the measurements, i.e., for $i\in\Omega_{\tau}$ and no predictions are made for the outliers. As before, we associate Jeffery’s prior to the precisions $\beta$ , $\{\gamma_{i}\}$ , $\{\upsilon_{i}\}$ , and $\{\alpha_{i\tau}\}$ .

[TABLE]

Let the vectors $\mathbf{e}\in\mathbb{R}^{\omega}$ and $\bm{\alpha}\in\mathbb{R}^{\omega}$ collect the variables $\{e_{i\tau}\}$ and $\{\alpha_{i\tau}\}$ , respectively. Likewise, defining all the hidden variables as $\mathcal{H}:=\{\mathbf{A},\mathbf{B},\mathbf{J},\mathbf{e},\beta,\bm{\gamma},\bm{\upsilon}\}$ , the joint distribution of $\{\mathbf{y}_{\Omega},\mathcal{H}\}$ can be written as

[TABLE]

The full hierarchical Bayesian model adopted here is summarized in figure 2(b).

III-A Variational Bayesian Inference

Utilizing the mean field approximation, the posterior distribution $p(\mathcal{H}\mid\mathbf{y}_{\Omega})$ factorizes as

[TABLE]

where the individual factors take the same forms as in (9), in addition to

[TABLE]

As before, the variational inference problem can be solved by updating the variables $\{\bm{\mu}^{\mathbf{B}},\bm{\Xi}^{\mathbf{B}},\{\bm{\mu}^{\mathbf{A}}_{i}\},\{\bm{\Xi}^{\mathbf{A}}_{i}\},\{\bm{\mu}^{\mathbf{J}}_{i}\},\{\bm{\Xi}^{\mathbf{J}}_{i}\}$ , $\{\mu_{e}^{i\tau}\},\{\Xi_{e}^{i\tau}\},a^{\beta},b^{\beta}$ , $\{a^{\gamma}_{i}\},\{b^{\gamma}_{i}\},\{a^{\upsilon}_{i}\},\{b^{\upsilon}_{i}\}\}$ in a cyclic manner. However, a more compact form for the updates may be derived as follows.

Specifically, the updates for $\{\hat{\upsilon}_{i},\hat{\gamma}_{i}\}$ remain the same as in (10). However, the update for $\hat{\beta}$ takes the form:

[TABLE]

Further, the parameters $\mu_{e}^{i\tau}$ and $\Xi_{e}^{i\tau}$ are updated as

[TABLE]

Proceeding similarly, the updates for $\{\bm{\mu}^{\mathbf{J}}_{i}\}$ , $\{\bm{\Xi}^{\mathbf{J}}_{i}\}$ , and $\{\bm{\Xi}^{\mathbf{A}}_{i}\}$ remain the same as in (15), while the updates for $\{\bm{\mu}^{\mathbf{A}}_{i}\}$ become:

[TABLE]

Finally, the updates for $\bm{\Xi}^{\mathbf{B}}$ remain the same but the updates of $\bm{\mu}^{\mathbf{B}}$ change. Specifically, the low complexity updates via LDL-decomposition remain mostly the same, except for the modified definition of $\mathbf{v}_{\tau}$ in (19) which now looks like

[TABLE]

The full robust subspace filtering algorithm is summarized in Algorithm 2. The predictions for $y_{i\tau}$ for $i\notin\Omega_{\tau}$ and for $\tau\geq t+1$ are obtained as in (21) and (22), respectively.

IV Results

We now detail the simulation results that evaluate the performance of the proposed VBSF method on variety of datasets to solve the:

Traffic Estimation and Prediction Problem 2. 2.

Electricity Load Estimation and Prediction Problem

IV-A Datasets

•

Traffic: for traffic estimation and prediction, we use the partial road network of the city of New Delhi with an area of 200 square kms consisting of $m=519$ edges (shown in Fig. 3). The road network can be modeled using a directed graph where each edge represents a road segment and nodes represent intersections. We collect the traffic data in the form of average speed of vehicles on a particular segment using the Google map APIs for nearly 3 months across 519 edges. Taking advantage of the slow varying nature of the speed in the network edges, we sample the traffic data at the rate of one sample every $t_{s}=15$ minutes. Note that our algorithm is agnostic of the sampling rate and would work for higher sampling rates as well. Unlike the complete data available from the API, real-world data may have missing entries. For instance, over the smaller area shown in Fig. 4, speed measurements may be available on the blue edges but not on the red ones. Finally, we evaluate our algorithm for the twin tasks of real time traffic estimation as well as future traffic prediction. We further evaluate our algorithm for robust traffic estimation , i.e., when we the traffic data is corrupted by outliers.

•

Electricity: similar to the traffic estimation and prediction task, we evaluate the VBSF algorithm on the electricity dataset [8], also used in [14] to evaluate the online matrix factorisation method. The data contains the hourly power consumption of 370 consumers, sampled every 15 min. The data is recorded from Jan. 1, 2012 to Jan. 1, 2015. Finally, we compare the VBSF method with various methods including the ones proposed and compared in [14].

In order to evaluate the VBSF algorithm, an incomplete data set is created by randomly sampling a fraction $p$ of the measurements. In our evaluations we consider three different cases with 75%, 50%, and 25% of missing data. We select previous $h$ = 30 time intervals for traffic and, the previous $h$ = 40 time intervals for electricity dataset. We compare our algorithm with other methods that potentially solve the current traffic estimation problem in the missing data scenario. The algorithms are

•

Low rank tensor completion (LRTC) [7].

•

Grassmannian Rank-One Update Subspace Estimation (GROUSE) [2].

•

Historic mean, which is simply the mean of edge speed values at a given time instance calculated using the historic data.

For the robust VBSF, we compare our algorithm with corresponding robust matrix completion frameworks.

•

Robust PCA via Outlier Pursuit (OP-RPCA) [13].

•

Robust Online Subspace Estimation and Tracking Algorithm (ROSETA) [10].

•

Grassmannian Robust Adaptive Subspace Tracking Algorithm (GRASTA) [9].

Further, for the electricity load prediction problem, we compare our algorithm with the results of [14] and the Collaborative Kalman Filter (CKF) [14].

IV-B Traffic Estimation and Prediction Problem

IV-B1 Performance Index

To measure the effectiveness of our algorithm and for the comparison with other relevant algorithms, we use mean relative error (MRE) as the performance index for the traffic data. For any time instance $\tau$ , the MRE denoted by $\text{MRE}_{\tau}$ is defined as:

[TABLE]

where $\mathbf{y}_{\tau,k}$ and $\hat{\mathbf{y}}_{\tau,k}$ are the ground truth and estimated data for $k^{th}$ day and $\tau^{th}$ time instance. Since the value for the known data (sampled entries) may be modified post estimation, we compute the MRE over the whole column for a given time instance. For calculating the overall accuracy of prediction for a day, we calculate MRE averged over $z$ days. The value of $z$ is taken as 50 for weekdays and 10 for the weekends.

IV-B2 Online Real Time Traffic Estimation

We now discuss simulation results for the current traffic estimation based on the current and past missing data using the VBSF algorithm. For a typical day, Fig. 5a shows the heatmap of the actual traffic data. The $x$ -axis of each heatmap represents time instances while the $y$ -axis represents the edges. Each pixel of a heatmap indicates the speed, where higher speed is represented by a lighter colour. Figures 5b, 5e and 5h are heatmaps with missing entries of varying degrees. The corresponding completed matrices using VBSF algorithm are shown in Figs. 5c, 5f, and 5i. Since the proposed VBSF is an online method that completes one column at a time given the incomplete data from previous columns, the corresponding heatmaps are also generated in an online fashion. In other words, in spirit of the online methodology, window of $h+1$ incomplete columns are used to complete the last column followed by moving the window by one column. Finally, all the completed columns form a matrix represented in these heatmaps. Unsurprisingly, the heatmaps show that the performance of VBSF improves as the size of missing data decreases.

The MRE values for real time traffic estimation using VBSF for weekends is shown in Fig. 6a and for weekdays in Fig. 6b. It is observed that the prediction error is higher during the peak traffic time (in the evening) vis-a-vis non-peak time intervals. This may be due to a greater variance in traffic during the peak time intervals. However, the difference between the MRE values for 50% and 25% missing data case is only about 0.15 in the worst case. Equivalently, the average error of estimation of speed is only around 2 km/hr during the peak-time when the average speed is 15 km/hr even with 75% missing data. Similarly, for non-peak hours, even though the observed speed are higher (around 30-40 km/hr), the MRE values for $p=50\%$ and $p=25\%$ is around 0.1, which in other words indicate an average error of 3-4 km/hr in the estimation of speed.

The performance of the proposed VBSF algorithm is compared with that of (LRTC) [7], (GROUSE) [2], and the historic mean. We used a grid search based approach for rank initialization in GROUSE and choose the rank that gives the least error. Table I presents the overall results. Further, Figs. 7a and 7b show the comparison of our algorithm for different percentage of missing traffic data. It is observed that for low missing rate of traffic data (25%), the LRTC (low rank tensor completion) [7] and VBSF obtain similar performance. But as the missing data increases, VBSF outperforms the LRTC method. Also, for all the cases, VBSF performs better than GROUSE. This difference in performance can be attributed to the fact that the VBSF framework captures the temporal dependencies as well as the latent factors in the traffic matrix better than other methods. In terms of running time, VBSF is faster than LRTC and is comparable to GROUSE as shown in Table II.

11footnotetext: Experiments are conducted to evaluate average running time per column on Matlab using PC: Intel i5-6200U CPU 2.4 GHz.

IV-B3 Future Traffic Prediction Problem

We also test the VBSF algorithm for speed prediction during the future time intervals assuming randomly sampled data from the current and previous time intervals. We predict traffic data up to 5 sampling intervals, that is, 15 to 75 minutes in future. We test our algorithm for 50% and 75% of the missing entries in the traffic data. The MRE plots for traffic prediction are shown in Figs. 6c, 6d, and 6e. The MRE error difference for 50% and 75% missing data is not significant. Similar to observations from the current traffic estimation simulations, it is seen that the error increases from 5:30 to 8:00 pm. As one would expect, the prediction accuracy decreases as we predict further in future. Interestingly, it is observed that the MRE for real-time traffic estimation with 75% missing entries case and for future prediction with 50% missing entries are comparable as can be seen in Fig. 6f.

The performance of the proposed VBSF algorithm is compared with that of LRTC in Table III. The VBSF performs better than the LRTC as shown in Fig. 7c. While predicting the speed for outlier edges (the edges which significantly deviate from their usual speed) VBSF performs better than LRTC as seen in Fig. 7d.

IV-B4 Robust Traffic Estimation

The GPS data that is collected using probe vehicles may be corrupted by noise and may often contain outliers which need to be removed before further processing is performed. To mitigate the performance degradation due to outliers, we employ the robust variational Bayesian subspace filtering (RVBSF) that models the presence of outliers in the data in the sparse outlier matrix ${\bf E}$ . To test the RVBSF algorithm, on a given day, we randomly sample a certain $p_{o}$ percentage of the already sampled traffic data $\mathbf{y}_{i,\tau}$ and replace these values with $o_{i,\tau}$ as follows:

[TABLE]

In other words, the outlier is created by adding a large value $c\,\mu_{t}$ to the maximum of $\mathbf{y}_{i,\tau-1}$ and $\mathbf{y}_{i,\tau+1}$ . Here, $\mu_{t}$ is the mean of observed entries at time $t$ and c is a scaling parameter. The RVBSF algorithm is then applied to solve the real time traffic estimation problem. The detected artificial outliers are those points residing in the matrix ${\bf E}$ .

The accuracy of outlier detection depends on the outlier value as shown in Fig. 8d. The value of $c$ for simulations is chosen from the set $[0.75,1,1.25,1.5,1.75]$ . We compare the robust VBSF (termed as RVBSF) with VBSF for two scenarios. First, when no outliers are added (VBSF), second, when outliers are present in the data but only VBSF was used (VBSF_with_outliers). Table IV summarises the overall performance of the RVBSF algorithm. Understandably, RVBSF improves over VBSF when outliers are present, but is still worse than the MRE of VBSF for the case when no outliers were present. For 25% missing entries, $p_{o}=5\%$ and $c=0.75$ , the plots in Fig. 8a illustrate the performance of the RVBSF algorithm. Similarly for 75% of missing entries, $p_{o}=2\%$ the results are shown in Fig. 8b. When $p_{o}=5\%$ and $c=0.75$ , we observe that RVBSF detects outliers reasonably well vis-a-vis VBSF_with_outliers. Similar observation holds when outlier values increase as shown in Fig. 8c and Fig. 8d.

The performance of the proposed RVBSF algorithm is compared with that of OP-RPCA[13] GRASTA[9] and ROSETA[10] in Table V. The RVBSF algorithm performs better than the subspace estimation and tracking algorithms. The difference in performance may be due to a better modeling of the temporal structure available in the data.

A possible limitation of the suggested robust traffic estimation framework is following. While there may be outliers present due to an erroneous speed estimation, there might be cases when the so called outlier value may actually be a real value. The current method may not be able to distinguish between such cases. Hence, a sudden drop in speed along an edge may be treated as an outlier and its possible impact on the traffic of nearby edges be be ignored by the model.

IV-C Electricity Load Prediction

We now discuss the performance of the VBSF algorithm on the electricity load data set [8]. Note that the electricity load data is also a time series data with the possibility of missing entries as well as temporal correlation between successive columns.

IV-C1 Performance Index

The performance of the VBSF method is compared with that of [14] using the metrics mean absolute error (MAE) and MRE, defined as:

[TABLE]

where $\mathbf{y}_{k}$ and $\hat{\mathbf{y}}_{k}$ are the ground truth and estimated data for $k^{th}$ column. We run the algorithm online on dates Jan. 1, 2012 to Jan. 1, 2015 resulting into 26,304 columns. In other words, the value of $z$ is 26,304 for our simulations.

IV-C2 Online Electricity Load Estimation and Prediction

We run our algorithm for electricity data estimation and prediction. The results for real-time prediction are noted in table VI. It is noted as the percentage of observed data $p$ increases, the real-time prediction accuracy improves.

Further, we predict the one-step ahead electricity load in Fig. 9. To analyze the performance of our algorithm we compare our results with OFMF and CKF [14]. The one-step ahead prediction performance of OFMF and CKF are provided in [14]. OFMF proposes a autoregressive model based optimization to predict the one-step ahead electricity load. We compare our three cases of $p$ with the results shown in OFMF. It can be seen that our algorithm performs better than the OFMF for electricity load dataset.

V Conclusion

This paper considers sequentially arriving multivariate data that resides in a time-varying low-dimensional subspace. The temporal evolution of the underlying low-rank subspace is characterized via a state-space model and low-complexity variational Bayesian subspace filtering algorithms are proposed for matrix completion and outlier removal tasks. Simulation experiments quantify that the suggested model can be deployed to estimate the missing traffic data with a reasonable accuracy even with a fraction of random traffic measurements in the network. A similar result is observed on applying the VBSF algorithm on the twin tasks of imputation and prediction on the electricity data-set. Extensive simulations on both the data sets demonstrate that the suggested model and the accompanying algorithms seem to capture the temporal evolution of the data well as compared to the current state-of-the-art matrix completion and the online subspace estimation algorithms.

Bibliography37

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] E. J. Candès and B. Recht, “Exact matrix completion via convex optimization,” Found. Comp. Math. , vol. 9, no. 6, pp. 717–772, 2009.
2[2] L. Balzano, R. Nowak, and B. Recht, “Online identification and tracking of subspaces from highly incomplete information,” in Proc. of IEEE Allerton , Sept. 2010, pp. 704–711.
3[3] S. D. Babacan, M. Luessi, R. Molina, and A. K. Katsaggelos, “Sparse Bayesian methods for low-rank matrix estimation,” IEEE Trans. Signal Process. , vol. 60, no. 8, pp. 3964–3977, 2012.
4[4] P. V. Giampouras, A. A. Rontogiannis, K. E. Themelis, and K. D. Koutroumbas, “Online sparse and low-rank subspace learning from incomplete data: A Bayesian view,” Signal Processing , vol. 137, pp. 199–212, 2017.
5[5] E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?” Journal of the ACM , vol. 58, no. 3, pp. 11:1–11:37, 2011.
6[6] X. Ding, L. He, and L. Carin, “Bayesian robust principal component analysis,” IEEE Trans. Image Process. , vol. 20, no. 12, pp. 3419–3430, 2011.
7[7] J. Liu, P. Musialski, P. Wonka, and J. Ye, “Tensor completion for estimating missing values in visual data,” IEEE Trans. Pattern Anal. Mach. Intell , vol. 35, no. 1, pp. 208–220, 2013.
8[8] UCI, “Electricity load diagrams 2011-2014,” 2014. [Online]. Available: https://archive.ics.uci.edu/ml/datasets/Electricity Load Diagrams 20112014/