A Robust Time Series Model with Outliers and Missing Entries

Triet M. Le

arXiv:1901.10589·stat.AP·January 31, 2019

A Robust Time Series Model with Outliers and Missing Entries

Triet M. Le

PDF

Open Access

TL;DR

This paper introduces a robust time series modeling approach that effectively handles outliers and missing data by using sparsity constraints and uncertainty modeling, validated through simulations.

Contribution

It proposes a novel robust modeling framework for univariate time series with outliers and missing entries, incorporating sparsity and uncertainty constraints.

Findings

01

Validated with simulated results showing robustness

02

Effectively handles outliers and missing data

03

Reduces active coefficients via sparsity constraints

Abstract

This paper studies the problem of robustly learning the correlation function for a univariate time series with the presence of noise, outliers and missing entries. The outliers or anomalies considered here are sparse and rare events that deviate from normality which is depicted by a correlation function and an uncertainty condition. This general formulation is applied to univariate time series of event counts (or non-negative time series) where the correlation is a log-linear function with the uncertainty condition following the Poisson distribution. Approximations to the sparsity constraint, such as $ℓ^{r}, 0 < r \leq 1$ , are used to obtain robustness in the presence of outliers. The $ℓ^{r}$ constraint is also applied to the correlation function to reduce the number of active coefficients. This task also helps bypassing the model selection procedure. Simulated results are presented to…

Equations201

P (y_{i} ∣ u_{i}, θ_{i})

P (y_{i} ∣ u_{i}, θ_{i})

u_{i} = f ({u_{j}}_{j < i}, {y_{j}}_{j < i}) .

u_{i} = f ({u_{j}}_{j < i}, {y_{j}}_{j < i}) .

J (a_{0}, a, b, y) a_{0}, a, b, y min J (a_{0}, a, b, y), \mbox w h er e = i = 1 \sum N [u_{i} - y_{i} lo g (u_{i}) + lo g (Γ (y_{i} + 1))] + λ i \in D \sum ∣ y_{i} - \tilde{y}_{i} ∣^{r} + μ (k = 1 \sum p ∣ a_{k} ∣^{s} + k = 1 \sum q ∣ b_{k} ∣^{s}),

J (a_{0}, a, b, y) a_{0}, a, b, y min J (a_{0}, a, b, y), \mbox w h er e = i = 1 \sum N [u_{i} - y_{i} lo g (u_{i}) + lo g (Γ (y_{i} + 1))] + λ i \in D \sum ∣ y_{i} - \tilde{y}_{i} ∣^{r} + μ (k = 1 \sum p ∣ a_{k} ∣^{s} + k = 1 \sum q ∣ b_{k} ∣^{s}),

u_{i} = f ({y_{j}}_{j \leq i - 1}, {u_{j}}_{j \leq i - 1}) = max (exp (lo g (u_{i} + 1)) - 1, 0),

u_{i} = f ({y_{j}}_{j \leq i - 1}, {u_{j}}_{j \leq i - 1}) = max (exp (lo g (u_{i} + 1)) - 1, 0),

lo g (u_{i} + 1) = a_{0} + k = 1 \sum p a_{k} lo g (y_{i - k} + 1),

lo g (u_{i} + 1) = a_{0} + k = 1 \sum p a_{k} lo g (y_{i - k} + 1),

a_{0}

a_{0}

a = (a_{1}, a_{2}, a_{3}, a_{4}, a_{5})

y_{i} = ϕ_{0} + k = 1 \sum p ϕ_{k} y_{i - k} + k = 1 \sum q ψ_{k} e_{i - k} + e_{i},

y_{i} = ϕ_{0} + k = 1 \sum p ϕ_{k} y_{i - k} + k = 1 \sum q ψ_{k} e_{i - k} + e_{i},

u_{i} := ϕ_{0} + k = 1 \sum p ϕ_{k} y_{i - k} + k = 1 \sum q ψ_{k} e_{i - k},

u_{i} := ϕ_{0} + k = 1 \sum p ϕ_{k} y_{i - k} + k = 1 \sum q ψ_{k} e_{i - k},

P (y_{i} ∣ u_{i}, θ_{i} = σ) = P (y_{i} - u_{i} = e_{i}) = \frac{1}{σ 2 π} e^{- \frac{∣ y _{i} - u _{i} ∣ ^{2}}{2 σ ^{2}}} .

P (y_{i} ∣ u_{i}, θ_{i} = σ) = P (y_{i} - u_{i} = e_{i}) = \frac{1}{σ 2 π} e^{- \frac{∣ y _{i} - u _{i} ∣ ^{2}}{2 σ ^{2}}} .

u_{i} = f ({u_{j}}_{j < i}, {y_{j}}_{j < i}) = ϕ_{0} + k = 1 \sum p a_{k} y_{i - k} + k = 1 \sum q b_{k} u_{i - k},

u_{i} = f ({u_{j}}_{j < i}, {y_{j}}_{j < i}) = ϕ_{0} + k = 1 \sum p a_{k} y_{i - k} + k = 1 \sum q b_{k} u_{i - k},

P (y_{i} ∣ u_{i}) = \frac{u _{i}^{y_{i}} e ^{- u_{i}}}{y _{i} !},

P (y_{i} ∣ u_{i}) = \frac{u _{i}^{y_{i}} e ^{- u_{i}}}{y _{i} !},

u_{i} = f ({u_{j}}_{j < i}, {y_{j}}_{j < i}) = a_{0} + k = 1 \sum p a_{k} y_{i - k} + k = 1 \sum q b_{k} u_{i - k},

u_{i} = f ({u_{j}}_{j < i}, {y_{j}}_{j < i}) = a_{0} + k = 1 \sum p a_{k} y_{i - k} + k = 1 \sum q b_{k} u_{i - k},

lo g (u_{i} + 1) = a_{0} + k = 1 \sum p a_{k} lo g (y_{i - k} + 1) + k = 1 \sum q b_{k} lo g (u_{i - k} + 1) .

lo g (u_{i} + 1) = a_{0} + k = 1 \sum p a_{k} lo g (y_{i - k} + 1) + k = 1 \sum q b_{k} lo g (u_{i - k} + 1) .

(f^{*}, θ^{*}) = argmax_{f, θ} {L (f, θ) = P (y, f, θ)} .

(f^{*}, θ^{*}) = argmax_{f, θ} {L (f, θ) = P (y, f, θ)} .

P (y, f, θ) = i = 1 \prod N P (y_{i} ∣ u_{i}, θ_{i}) P (f) P (θ),

P (y, f, θ) = i = 1 \prod N P (y_{i} ∣ u_{i}, θ_{i}) P (f) P (θ),

(f^{*}, θ^{*}) = argmin_{f, θ} {- i = 1 \sum N lo g (P (y_{i} ∣ u_{i}, θ_{i})) - lo g (P (f) P (θ))} .

(f^{*}, θ^{*}) = argmin_{f, θ} {- i = 1 \sum N lo g (P (y_{i} ∣ u_{i}, θ_{i})) - lo g (P (f) P (θ))} .

\overset{y}{^}_{N + 1} := argmax_{y_{N + 1}} P (y_{N + 1} ∣ u_{N + 1}, θ_{N + 1}) .

\overset{y}{^}_{N + 1} := argmax_{y_{N + 1}} P (y_{N + 1} ∣ u_{N + 1}, θ_{N + 1}) .

y, f, θ min {- i = 1 \sum N lo g (P (y_{i} ∣ u_{i}, θ_{i})) - lo g (P (f)) - lo g (P (θ))},

y, f, θ min {- i = 1 \sum N lo g (P (y_{i} ∣ u_{i}, θ_{i})) - lo g (P (f)) - lo g (P (θ))},

\begin{split}\min_{f,y,\theta}\Big{\{}J(y,f,\theta)&=-\sum_{i=1}^{N}\log(P(y_{i}|u_{i},\theta_{i}))-\log(P(f))-\log(P(\theta))\\ &+\lambda\sum_{i\in D}|y_{i}-\tilde{y}_{i}|^{r}\Big{\}}.\end{split}

\begin{split}\min_{f,y,\theta}\Big{\{}J(y,f,\theta)&=-\sum_{i=1}^{N}\log(P(y_{i}|u_{i},\theta_{i}))-\log(P(f))-\log(P(\theta))\\ &+\lambda\sum_{i\in D}|y_{i}-\tilde{y}_{i}|^{r}\Big{\}}.\end{split}

y_{k}^{*} = argmin_{y_{k}} [- i = 1 \sum N lo g (P (y_{i} ∣ u_{i}, θ_{i}))] .

y_{k}^{*} = argmin_{y_{k}} [- i = 1 \sum N lo g (P (y_{i} ∣ u_{i}, θ_{i}))] .

y_{N}^{*} = argmin_{y_{N}} [- lo g (P (y_{N} ∣ u_{N}, θ_{N}))] = argmax_{y_{N}} P (y_{N} ∣ u_{N}, θ_{N}) .

y_{N}^{*} = argmin_{y_{N}} [- lo g (P (y_{N} ∣ u_{N}, θ_{N}))] = argmax_{y_{N}} P (y_{N} ∣ u_{N}, θ_{N}) .

y_{k}^{*}

y_{k}^{*}

t \in R in f {E_{r} (t) = μ ∣ t ∣^{r} + \frac{1}{2} ∣ t - t^{'} ∣^{2}},

t \in R in f {E_{r} (t) = μ ∣ t ∣^{r} + \frac{1}{2} ∣ t - t^{'} ∣^{2}},

y, f, θ max P (\tilde{y}_{D}, y, f, θ),

y, f, θ max P (\tilde{y}_{D}, y, f, θ),

y_{i} = j = 1 \sum p x_{ij} a_{j} + e_{i} .

y_{i} = j = 1 \sum p x_{ij} a_{j} + e_{i} .

a \in R^{p} min ⎩ ⎨ ⎧ i = 1 \sum N y_{i} - j = 1 \sum p x_{ij} a_{j}^{2} = i = 1 \sum N ∣ e_{i} ∣^{2} ⎭ ⎬ ⎫ .

a \in R^{p} min ⎩ ⎨ ⎧ i = 1 \sum N y_{i} - j = 1 \sum p x_{ij} a_{j}^{2} = i = 1 \sum N ∣ e_{i} ∣^{2} ⎭ ⎬ ⎫ .

κ_{a \to u} = ∥ X ∥ \frac{∥ a ∥}{∥ X a ∥} \leq ∥ X ∥∥ X^{+} ∥

κ_{a \to u} = ∥ X ∥ \frac{∥ a ∥}{∥ X a ∥} \leq ∥ X ∥∥ X^{+} ∥

a min {i = 1 \sum N ρ (y_{i} - j = 1 \sum p x_{ij} a_{j}) = i = 1 \sum N ρ (e_{i})},

a min {i = 1 \sum N ρ (y_{i} - j = 1 \sum p x_{ij} a_{j}) = i = 1 \sum N ρ (e_{i})},

ρ (x) = {\frac{1}{2} ∣ x ∣^{2} c ∣ x ∣ - \frac{1}{2} c^{2} \mbox i f \mbox i f ∣ x ∣ < c ∣ x ∣ \geq c

ρ (x) = {\frac{1}{2} ∣ x ∣^{2} c ∣ x ∣ - \frac{1}{2} c^{2} \mbox i f \mbox i f ∣ x ∣ < c ∣ x ∣ \geq c

P (e_{i}) = f_{c} (e_{i}) = [C (c) e^{\frac{1}{2} c^{2}}] e^{- c ∣ e_{i} ∣},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Statistical Methods and Models · Statistical Methods and Inference · Forecasting Techniques and Applications

Full text

A Robust Time Series Model with Outliers and Missing Entries

Triet M. Le NGA Research, The National Geospatial-Intelligence Agency, Springfield, VA. Email: [email protected].

Abstract

This paper studies the problem of robustly learning the correlation function for a univariate time series with the presence of noise, outliers and missing entries. The outliers or anomalies considered here are sparse and rare events that deviate from normality which is depicted by a correlation function and an uncertainty condition. This general formulation is applied to univariate time series of event counts (or non-negative time series) where the correlation is a log-linear function with the uncertainty condition following the Poisson distribution. Approximations to the sparsity constraint, such as $\ell^{r},0<r\leq 1$ , are used to obtain robustness in the presence of outliers. The $\ell^{r}$ constraint is also applied to the correlation function to reduce the number of active coefficients. This task also helps bypassing the model selection procedure. Simulated results are presented to validate the model.

1 Introduction

Forecasting (anticipating) future events or activities is an important problem in data science. A common task for a forecaster is to predict normal future events using past and current observations and to alert when the observed number of events significantly deviates from the predicted value. If events and activities are random then there is no hope in making any meaningful future prediction. However, if events are correlated in the sense that events at time $t$ depend on events prior to $t$ and the correlation function that describes this dependency persists throughout the observed data, then this correlation function can be used to predict future events based on past and current observations. In many real datasets, the observed data is incomplete and is often contaminated with outliers and random noise. An important task is then to robustly learn the correlation function that describes the underlying normal activities and patterns from the observed data. We start with the following setup.

Let $\{t_{0},\cdots,t_{N}\}$ be a uniform discretization of some time interval of interest which we assume to be $[0,T)$ . Let $y_{i}$ be the number of observed events and $u_{i}$ be the expected number of events that occur in the time interval $[t_{i-1},t_{i})$ . Observed events are determined by the conditional probability

[TABLE]

where $\theta_{i}$ is some auxiliary variable, representing for instance the variance, that $y_{i}$ may depend on. In a univariate case, the goal is to learn how $u_{i}$ depends on prior $u_{j}$ and $y_{j}$ , for $j<i$ . In other words, one is interested in finding the function $f$ such that

[TABLE]

In this paper, we consider the case where only a partial series $\widetilde{y}_{D}:=\{\widetilde{y}_{i}\}_{i\in D}$ for some $D\subset\{1,\cdots,N\}$ is observed and it may contain outliers and anomaly. To tackle this problem, the following minimization problem is proposed

[TABLE]

for some $0<r,s\leq 1$ , and $\mu,\lambda>0$ , to jointly learn $f,\theta:=\{\theta_{i}\}_{i=1}^{N}$ and the complete series $y:=\{y_{i}\}_{i=1}^{N}$ via imputation.

•

Here, we consider parametrically the correlation function $f$ defined as

[TABLE]

where $\log(u_{i}+1)$ is given by

[TABLE]

for some $p,q\geq 0$ .

•

$\sum_{i\in D}|y_{i}-\tilde{y}_{i}|^{r}$ measures the sparsity of the sequence $\{y_{i}-\widetilde{y}_{i}\}_{i\in D}$ representing anomaly. Similarly, $\sum_{k=1}^{p}|a_{k}|^{s}$ and $\sum_{k=1}^{q}|b_{k}|^{s}$ impose sparsity on $a=\{a_{k}\}_{k=1}^{p}$ and $b=\{b_{k}\}_{k=1}^{q}$ and overcome the model selection issue (e.g. AIC, BIC, etc.). Both of these constraints are important for the recovery of $f$ .

With the presence of missing entries and outliers, Figure 1 shows the ability of the model (3) to recover $f$ having

[TABLE]

In the followings, we provide motivations for the proposed model. Many well-known time series models can be described by (1) and (2). For instance, in an $ARMA(p,q)$ model [5, 18], each $y_{i}$ is defined as

[TABLE]

where $e_{i}$ follows $N(0,\sigma^{2})$ , a normal Gaussian distribution with mean [math] and variance $\sigma^{2}$ . Let

[TABLE]

then equation (5) implies $y_{i}=u_{i}+e_{i}$ . This implies

[TABLE]

Substituting $y_{i-k}-u_{i-k}$ for $e_{i-k}$ in (6), one obtains

[TABLE]

for some real-valued $a_{k}$ and $b_{k}$ and some new positive integers $p$ and $q$ . Here we assume zero boundary conditions, that is $y_{i-k}$ and $u_{i-k}$ are identically zero whenever $i-k\leq 0$ . Clearly, other types of boundary conditions such as reflection or Neumann can be used. Thus (5) can be transformed into (7) and (8), which are (1) and (2) respectively.

Another example is the Poisson linear autoregressive model (see [20, 25], among others). Assuming $y_{i}$ is a non-negative integer, then (1) is given by

[TABLE]

and (2) is given by

[TABLE]

where $a_{0},a_{k},$ and $b_{k}$ are non-negative. This shows that the model can only detect zero or positive correlations. To overcome this drawback, the log-linear model is often used [40, 20, 25, 15] where $u_{i}$ is defined such that

[TABLE]

Here $a_{0}$ , $a_{k}$ and $b_{k}$ are real-valued and therefore can represent negative correlations.

In the case of complete observations where we are given the series $y=\{y_{i}\}_{i=1}^{N}$ and some prior knowledge on the conditional probability condition (1), the task is then to learn the optimal $(f^{*},\theta^{*})$ that maximizes the likelihood function,

[TABLE]

It can be shown (see for instance [27]) that the joint probability in (10) is given by

[TABLE]

assuming $f$ and $\theta$ are independent. The maximization problem in (10) is equivalent to

[TABLE]

With some knowledge about $\theta_{N+1}$ , the conditional probability $P(y_{N+1}|u_{N+1},\theta_{N+1})$ provides the distribution of $y_{N+1}$ . For a single value prediction $\hat{y}_{N+1}$ , we can solve

[TABLE]

In case where the conditional probability follows a Poisson distribution, then $y_{N+1}$ is completely determined by $u_{N+1}$ , and it doesn’t depend on the auxiliary variable $\theta_{N+1}$ .

In this paper, we consider the problem of learning parametrically the underlying correlation function $f$ and $\theta$ given a partially observed series $\widetilde{y}_{D}=\{\widetilde{y}_{i}\}_{i\in D}$ for some $D\subset\{1,\cdots,N\}$ which may be contaminated by outliers and anomalies. Given some prior knowledge about the uncertainty condition (1), the problem consists of: 1) extending $\widetilde{y}_{D}$ to the whole series $y$ (including $y_{i},i\in D^{c}$ ) via imputation such that $y_{i}\approx\widetilde{y}_{i}$ for all $i\in D$ and 2) using the complete series $y$ to learn $f$ . These two steps are done iteratively as they are inter-dependent. The interpretation of $y_{i}\approx\widetilde{y}_{i}$ is as follows:

Suppose $\widetilde{y}_{i}$ is normal, i.e. it can be described by $f,\theta$ and the uncertainty condition (1), then we would like to enforce $y_{i}=\widetilde{y}_{i}$ . 2. 2.

On the other hand if $\widetilde{y}_{i}$ is anomalous, then we allow $y_{i}\neq\widetilde{y}_{i}$ and let the model decides a normal value for $y_{i}$ .

Moreover, outliers and anomalies are interpreted as rare events that are supported sparsely on $D$ . Based on this interpretation, we would like the difference series $\{y_{i}-\widetilde{y}_{i}\}_{i\in D}$ to be sparse. Thus this can be seen as solving the optimization problem:

[TABLE]

with the constraint that the partial series $\{y_{i}-\widetilde{y}_{i}\}_{i\in D}$ is sparse. Here, $P(f)$ and $P(\theta)$ are priors on $f$ and $\theta$ respectively. Sparsity is an essential ingredient in the theory of compressed sensing [7, 8, 13]. Approximating the sparsity constraint has been a subject of great importance as the exact sparsity problem is NP-hard. Here we consider sparsity approximations as proposed in [7, 8, 13, 10] by using $\ell^{r}$ for $0<r\leq 1$ . Incorporating these sparsity approximations into the minimizing energy (14), we propose the following unconstraint variational problem

[TABLE]

Remark 1.

Predicting $y_{N+1}$ with the mean $u_{N+1}$ is optimal. However, for $k\notin D$ and $k<N$ , imputing $y_{k}$ with the mean $u_{k}$ is not always optimal. Indeed, suppose $u_{k}$ only depends on $p$ previous $y_{i}$ ’s, i.e. $u_{k}=f\left(\{y_{i}\}_{i=k-p}^{k-1}\right)$ for some $p\geq 1$ . Suppose also that both $f$ and $\theta$ are known with no outliers, that is $y_{i}=\tilde{y}_{i}$ for all $i\in D$ and $y_{k}$ is the only non-observed entry. The task is to compute the optimal $y_{k}^{*}$ by using (15). This amounts to solving

[TABLE]

If $k=N$ , then it is clear that only the last term in the above sum contains $y_{N}$ . This implies

[TABLE]

On the other hand, if $1\leq k<N$ , then

[TABLE]

The latter case shows that $\mathrm{argmax}_{y_{k}}P(y_{k}|u_{k},\theta_{k})$ is not always an optimal value for $y_{k}^{*}$ .

The paper is organized as follows. Section 2 recalls some related prior work that are most relevant. Section 3 describes the proposed Poisson log-linear model having the uncertainty condition (1) following the Poisson distribution and the correlation function (2) following a log-linear function. One subproblem for solving (15) is to compute

[TABLE]

for some fixed $t^{\prime}\in\mathbb{R}$ , $\mu>0$ and $0<r\leq 1$ . In [32], Nie-etal proposed a method for solving this problem via solving a zero of a strictly convex function. For completeness, we go over in section 4 a similar method for computing the proximal operator for $E_{r}(t)$ . Section 5 goes over an algorithm to compute a minimizer for (15). Section 6 shows numerical results on simulated data to validate the proposed model. In Appendix A we show that the above minimization problem (15) is related to

[TABLE]

for some fixed $\lambda>0$ and $r\in(0,1]$ .

2 Prior Work

In [23], Huber considered the classical least square problem of learning $p$ parameters $a_{1},\cdots,a_{p}$ from $N$ observations $y_{1},\cdots,y_{N}$ obeying the relation

[TABLE]

Here $x_{ij}$ are known coefficients and $e_{i}\approx N(0,\sigma^{2})$ are iid random Gaussian noise. For an autoregressive model, $x_{ij}=y_{i-j}$ . Estimating the parameter $a=(a_{j})_{j=1}^{p}\in\mathbb{R}^{p}$ amounts to minimizing the sum of squares

[TABLE]

Let $X=(x_{ij})_{N\times p}$ , $a=(a_{j})_{p\times 1}$ and $u=Xa$ , then the relative condition number measuring the sensitivity of $u$ with respect to perturbation of $a$ is [38]

[TABLE]

where $\|\cdot\|$ is an arbitrary matrix norm and $X^{+}$ is the pseudo inverse of $X$ if exists. Let $\sigma_{1}$ and $\sigma_{p}$ be the largest and smallest singular values of $X$ respectively, then by using $\|\cdot\|=\|\cdot\|_{2}$ , one has $\kappa_{a\rightarrow u}=\frac{\sigma_{1}}{\sigma_{p}}$ . If $\kappa_{a\rightarrow u}$ is large, a small deviation in $a$ can create large deviation in the solution $u$ and hence it is important to obtain an accurate estimate for $a$ . As noted in [23], outliers affect the accuracy of the estimates. Following Lecture 18 in [38], the condition number for measuring the sensitivity of $a$ with respect to perturbation of $y$ is $\kappa_{y\rightarrow a}=\kappa(X)/(\eta\cos(\theta))$ , where $\eta=\|X\|\|a\|/\|Xa\|$ and $\theta=\cos^{-1}(\|u\|/\|y\|)=\cos^{-1}(\|u\|/\|u+e\|)$ . The more noise and outliers in the system, the closer $\theta$ is to $\pi/2$ . This leads to a large value of $\kappa_{y\rightarrow a}$ . Hence it is important to have good estimations of both $a$ and $y$ in the presence of missing data and outliers in the observation.

In [22, 23], Huber proposed a robust alternative to (17) by considering

[TABLE]

where $\rho(x)$ is chosen so that it is less sensitive to large $|x|$ . In particular, the proposed $\rho$ has the form

[TABLE]

where $c$ is some chosen constant which is data dependent. From the definition of $\rho$ we see that if $|e_{i}|<c$ , least square is performed in (18) and hence the model respects the additive normal Gaussian noise assumption in (16). When $|e_{i}|\geq c$ , $e_{i}$ is no longer considered normal Gaussian but assumed to follow a Laplacian distribution of the form

[TABLE]

where $C(c)$ is a constant such that $\int_{-\infty}^{\infty}f_{c}(x)\ dt=1$ . Note that the Laplacian distribution has a wider tail than a Gaussian distribution and hence allows for the existence of large $|e_{i}|$ better than the Gaussian distribution. Another popular robust choice for $\rho$ is the least absolute deviation [2] which amounts to having $\rho(e_{i})=|e_{i}|$ , which also follows a Laplacian distribution.

The standard LASSO [37] amounts to learning a sparse parameter vector $a$ via minimizing

[TABLE]

This model is still sensitive to outliers. A modification to this model introduces an extra variable $z$ representing outliers (see [31] and references there in):

[TABLE]

Suppose $u_{i}=f(y_{i-p},\cdots,y_{i-1})$ . In [29], a robust nonparametric model is proposed:

[TABLE]

where $\mathcal{H}$ is a reproducing kernel Hilbert space (RKHS) which includes Sobolev spaces. A slightly different approach is proposed in [17], where each $u_{i}$ is given by $f((Xa)_{i})$ , and both $a$ and $f$ are learned via solving

[TABLE]

In connection with the proposed model (15), take $r=1$ and let $u_{i}=f(\{y_{j}\}_{j\leq i-1})=\sum_{j=1}^{p}x_{ij}a_{j}$ with $x_{ij}=y_{i-j}$ and

[TABLE]

for some fixed $\alpha>0$ . Note that $f$ is completely determined by $a\in\mathbb{R}^{p}$ and estimating the parameter vector $a$ with the LASSO prior amounts to minimizing

[TABLE]

Here, the sparse vector representing outliers is $\{y_{i}-\tilde{y}_{i}\}_{i\in D}$ . Note that the proposed method (15) is much more general to accommodate other types of noise in the data that is not additive (multiplicative Gaussian, Poisson, negative binomial, etc.)

There are quite a few existing methods on estimating the parameters in the presence of missing data, and from a high level perspective, they align with the following two approaches.

The first approach iteratively imputes missing and unobserved data in some manner and then use the imputed and observed data to estimate the parameters. These methods include mean imputation, expectation maximization (EM) [12], multiple imputation [35], among others. See [30, 33, 21] for a survey of some of these methods. Matrix completion [6, 9] is a form of imputation where missing entries in the data matrix are imputed under the assumption that the data matrix has low rank. The proposed imputation performed in Algorithm 4 for solving (15) follows along the line the iterative approach of the EM method [12]; but instead of maximizing the expectation we maximize the likelihood.

The second approach either doesn’t impute or only imputes the necessary missing entries that the observed entries depend on. For instance in full information maximization likelihood (FIML) method [14], only complete data points are used as inputs to estimate the parameters. Suppose we are only interested in estimating the constant $a_{0}$ with $p=q=0$ , then all observed data points are complete. However, for $p>0$ and suppose one in every consecutive $p$ points are missing then the set of complete data points is empty and hence the FIML method is not applicable. The non-negative definite covariance method [26, 11] only considers observed data points as inputs as opposed to complete data points. Here the observed data points may depend on the missing data, however this method sets this dependency to zero, that is imposing zero boundary conditions on the unobserved entries for which some of the observed entries may depend on (see section 3.2.1 in [11].) Further modification was introduced to acquire non-negative definite condition for the covariance matrix. This second approach can also be applied to our problem and it is more appropriate when the missing entries are systematic as oppose to random. To motivate the problem, consider the case where the correlation function is a constant, that is $u_{i}=c$ for all $i$ . Assuming, that the observed series $\widetilde{y}_{D}$ has no outliers, then $c$ can be approximated as the mean $\frac{1}{|D|}\sum_{i\in D}\tilde{y}_{i}$ and there is no need to impute $y_{i}$ for all $i\notin D$ . Similarly, suppose $D=\{k,\cdots,N\}$ and $u_{i}=f(y_{i-1})$ , then it is only necessary to impute $y_{k-1}$ as opposed to $y_{j}$ for all $j<k$ . One can view $y_{k-1}$ as an unknown boundary condition. Thus in this second approach, the minimization problem becomes

[TABLE]

where $\bar{D}$ consists of $D$ and all of the indices $j$ such that $u_{i}$ depends on $y_{j}$ for some $i\in D$ , and $D^{\prime}$ consists of all the indices $k$ such that $u_{i}$ depends on $u_{k}$ for some $i\in D$ . Note that the difference between (15) and (20) is that in (20) the sum is only over observed indices as opposed to all indices (observed and unobserved.) It would be interesting to compare this approach with the proposed method (15) and we leave this for a future work.

3 Poisson Log-Linear Model

In this section, we consider the Poisson distribution for the conditional uncertainty condition and using a log-linear correlation function to model $f$ in (15). In particular, we suppose that $y_{i}$ is only conditioned on $u_{i}$ and that it follows the Poisson distribution,

[TABLE]

where $\Gamma(y_{i}+1)=y_{i}!$ whenever $y_{i}$ is a nonnegative integer. Note that (21) is defined for all $y\geq 0$ . We consider $f$ to be a log-linear correlation function satisfying

[TABLE]

for some $p,q\geq 0$ with the constraint $u_{i}\geq 0$ . In other words,

[TABLE]

where $\log(u_{i}+1)$ is defined as in (22). Since $f$ is completely determined by $a_{0}$ , $a$ and $b$ , we see that the series $u$ is completely determined by $a_{0},a,b$ and the series $y$ .

The prior on $f$ is now given by

[TABLE]

Here we assume all the parameters are independent from each other, and follow a family of exponential probability distributions

[TABLE]

where $C(\mu,s)$ is chosen such that $\int_{-\infty}^{\infty}f_{\mu,s}(x)\ dx=1$ . $s=1$ corresponds to the LASSO constraint [37] and $s\in(0,1)$ corresponds to the Bayesian bridge constraint proposed in [16, 34].

Combining (21)-(23) into (15), the proposed variational problem becomes

[TABLE]

for some $0<r,s\leq 1$ , and $\mu,\lambda>0$ .

Remark 2.

It is possible to minimize the energy in (24) over the set of nonnegative integer-valued series $y$ . However, this set is not convex. To overcome this non-convexity we extend (21) to all nonnegative real-valued series $y$ and therefore use $\Gamma(y_{i}+1)$ as oppose to $y_{i}!$ . We remark that this extension is not the continuous version of the Poisson distribution proposed in [28, 24]. Given $\lambda\geq 0$ , the cumulative probability distribution for the continuous Poisson is defined as $F_{\lambda}(t)=0$ if $t=0$ and

[TABLE]

Here

[TABLE]

is the incomplete Gamma function. Let the probability distribution function $f_{\lambda}$ be defined such that

[TABLE]

It can be shown that for all $t\geq 0$ ,

[TABLE]

Thus instead of (21), one can use $P(y_{i}|u_{i})=f_{u_{i}}(y_{i})$ or

[TABLE]

where $\lfloor t\rfloor$ is the largest integer that is less than or equal to $t$ .

4 Proximal Mapping for $\ell^{r}$ , $0<r<1$

One of the ingredients for computing (24) is to solve a subproblem

[TABLE]

for some $0<r\leq 1$ , $\mu>0$ and $t^{\prime}\in\mathbb{R}$ . For $r=1$ , it was shown in [37] that $J_{\partial E_{1}}(t^{\prime})$ is given by

[TABLE]

where $x_{+}=\max(x,0)$ . For $0<r<1$ , a common approach for solving (25) is to consider a regularized version via

[TABLE]

for some $\epsilon>0$ . The minimizer for $H_{r}(t)$ is approximated by $t_{\infty}=\lim_{n\rightarrow\infty}t_{n}$ where

[TABLE]

with some initial guess $t_{0}$ . Since $H_{r}(t)$ is nonconvex, $t_{\infty}$ may not be a global minimizer. In remark 3 below we show that for a range of $\mu$ , with the initial guess $t_{0}=t^{\prime}$ , the above iteration will converge to $t_{\infty}$ that is not a global minimizer.

There are explicit formula for $\mathrm{shrink}_{r}$ when $r=1/2$ or $r=2/3$ [41], but for general $r\in(0,1)$ , no closed-form expression for $\mathrm{shrink}_{r}$ exists. In [32], Nie-Etal proposed a method for solving $\mathrm{shrink}_{r}$ via computing a zero of a strictly convex function using Newton method. For completeness, we present here a method similar to the one proposed in [32].

Recall (25) and for simplicity assume $t^{\prime}\geq 0$ and denote by $t^{*}$ the global minimizer for $E_{r}(t)$ . First, note that $t^{*}\geq 0$ . Now, for $t>0$ we get

[TABLE]

Define

[TABLE]

Note also that for $t>0$ , $g_{r}(t)=t^{1-r}\frac{dE_{r}}{dt}(t)$ which shows that $g_{r}(t)=0$ if and only if $\frac{d}{dt}E_{r}(t)=0$ . Suppose $t^{*}>0$ . This implies that $t^{*}$ must satisfy

[TABLE]

On $(0,\infty)$ , we get

[TABLE]

and

[TABLE]

which is strictly greater than [math]. This implies $g_{r}$ is strictly convex on $(0,\infty)$ and achieving its minimal value at $t_{0}=\frac{1-r}{2-r}t^{\prime}$ . Moreover, we have

[TABLE]

This implies the followings:

If $g_{r}(t_{0})>0$ , that is

[TABLE]

then $g_{r}$ has no zeros on $(0,\infty)$ . This shows that a global minimizer $t^{*}>0$ does not exist. 2. 2.

If $g_{r}(t_{0})<0$ , that is

[TABLE]

then $g_{r}$ has exactly two zeros $t_{1}\in(0,t_{0})$ and $t_{2}\in(t_{0},t^{\prime})$ . Since $0<r<1$ and $\mu>0$ , the function $E_{r}(t)$ is strictly increasing near zero. This implies that the zero $t_{1}$ is not a local minimizer for $E_{r}$ , and hence $t^{*}=t_{2}$ . 3. 3.

If $g_{r}(t_{0})=0$ , that is

[TABLE]

then $g_{r}$ has exactly one zero at $t_{0}$ . Since $t_{0}$ is the first zero of $g_{r}$ and $E_{r}$ is strictly increasing near [math], we get that a global minimizer $t^{*}>0$ does not exist.

Remark 3.

Cases (28) and (30) correspond to having $E_{r}(t)$ strictly increasing on $[0,\infty)$ , and hence $t^{*}=0$ is the global minimizer. Note in the case (30), $t_{0}$ is a saddle point of $E_{r}$ . As for the case (29), $E_{r}(t)$ has one positive local minimizer at $t_{2}$ and therefore the global minimizer $t^{*}=\mathrm{argmin}_{t\in\{0,t_{2}\}}E_{r}(t)$ . It is possible that $E_{r}(0)=E_{r}(t_{2})$ for some $\mu>0$ . In this case there is no uniqueness to the global minimizer of $E_{r}$ .

Figure 2 shows the plots of $E_{r}(t)$ and the corresponding $g_{r}(t)$ with $r=\frac{1}{2}$ and $t^{\prime}=5$ for various choices of $\mu$ . The green lines correspond to the value $E_{r}(0)$ in $(i)$ and [math] in $(ii)$ . Let $\mu_{0}=\frac{1}{r}\left(t^{\prime}-t_{0}\right)t_{0}^{1-r}$ . In the cases where $\mu=2\mu_{0},\mu_{0},\frac{3}{4}\mu_{0}$ , the global minimizer for $E_{r}$ is [math]. However, for $\mu=\frac{1}{4}\mu_{0}$ , the second zero of the corresponding $g_{r}$ is the global minimizer for $E_{r}$ .

From the above remark, the shrinkage operator in (25) for some $t^{\prime}\geq 0$ is given by

[TABLE]

Computing for the second zero $t_{2}\in(t_{0},t^{\prime})$ of $g_{r}$ (assuming $g_{r}(t_{0})<0$ ) is fast and straightforward since $g_{r}$ is strictly convex. For instance, one can use the Newton method as follow:

Algorithm 1.

Newton method for computing the second zero of $g_{r}$ (assuming $t^{\prime}\geq 0$ and $g_{r}(t_{0})<0$ ):

Set $t_{0}=t^{\prime}$ , $t_{1}=t_{0}-g_{r}(t_{0})/g_{r}^{\prime}(t_{0})$ and $\epsilon=$ small. 2. 2.

while $|t_{m}-t_{m-1}|>\epsilon$ .

•

$t_{m+1}=t_{m}-g_{r}(t_{m})/g_{r}^{\prime}(t_{m})$ . 3. 3.

end while.

In our numerical simulation, the above algorithm converges in $\leq 5$ Newton iterations with $\epsilon=1e$ - $6$ .

For general $t^{\prime}\in\mathbb{R}$ , we have

[TABLE]

where $\mathrm{shrink}_{r}(|t^{\prime}|,\mu)$ is given in (31).

Remark 4.

$E_{r}(t)$ can also be defined for $r=0$ . In this case, we get $E_{0}(0)=\frac{1}{2}(t^{\prime})^{2}$ and $E_{0}(t)=\mu+\frac{1}{2}(t-t^{\prime})^{2}$ for $t\neq 0$ . This implies for $t^{\prime}\neq 0$ , $\frac{dE_{0}}{dt}(t)=0$ whenever $t=t^{\prime}$ . Thus we obtain the following shrinkage operator (hard thresholding)

[TABLE]

Note that if $E_{0}(0)=\frac{1}{2}(t^{\prime})^{2}=\mu=E_{0}(t^{\prime})$ then there is no uniqueness of minimizer. Here we choose $t^{\prime}$ to be the minimizer but choosing [math] is also appropriate.

5 Numerical Implementation

There are numerous numerical methods that can be used to solve a minimizer for (24) (see [19] and references there in.) We mention in particular the FISTA algorithm [3] which provides a global rate of convergence when the minimizing energy is convex. The functional in (24) is related blind-deconvolution which is jointly nonconvex even in the case when $r=s=1$ . When both $r$ and $s$ are rationals in $(0,1)$ , numerical schemes such as PALM [4] or Block Prox-Linear Method [39] provides global convergence. Numerical schemes FISTA and PALM are described in algorithms 2-3. Algorithm 4 is a combination of the two where we apply the time-step updating criteria from FISTA to the proximal alternating scheme in PALM. We will make comparisons between PALM and Algorithm 4 via numerical simulations.

We rewrite the energy from (24) as

[TABLE]

where

[TABLE]

Let

[TABLE]

where

[TABLE]

The proximal mappings for $G_{i}$ ’s are defined as:

[TABLE]

where $x$ is a vector in $\mathbb{R}^{N}$ such that $x_{D}=y_{D}-\widetilde{y}_{D}$ . In the followings, denote by $\mathcal{P}_{D}(y)$ and $\mathcal{P}_{D^{c}}(y)$ the projection of $y\in\mathbb{R}^{N}$ on to $D$ and its complement $D^{c}$ , respectively.

Algorithm 2.

FISTA [3] applying to (24).

•

Initialize: $c_{0,2}=a_{0,1}$ , $c_{2}=a_{1}$ , $d_{2}=b_{1}$ , $z_{2}=y_{1}$ , $\alpha_{2}=1$ , $\tau=$ small enough and $\epsilon>0$ (tolerance.)

•

Do

$\alpha_{m+1}=(1+\sqrt{1+4\alpha_{m}^{2}})/2$ . 2. 2.

$a_{0,m}=c_{0,m}-\tau\nabla_{a_{0}}H(c_{0,m},c_{m},d_{m},z_{m})$ . 3. 3.

$c_{0,m+1}=a_{0,m}+\frac{\alpha_{m}-1}{\alpha_{m+1}}(a_{0,m}-a_{0,m-1})$ . 4. 4.

$a_{m}=J_{\tau\partial G_{1}}\left[c_{m}-\tau\nabla_{a}H(c_{0,m},c_{m},d_{m},z_{m})\right]$ . 5. 5.

$c_{m+1}=a_{m}+\frac{\alpha_{m}-1}{\alpha_{m+1}}(a_{m}-a_{m-1})$ . 6. 6.

$b_{m}=J_{\tau\partial G_{2}}\left[d_{m}-\tau\nabla_{b}H(c_{0,m},c_{m},d_{m},z_{m})\right]$ . 7. 7.

$d_{m+1}=b_{m}+\frac{\alpha_{m}-1}{\alpha_{m+1}}(b_{m}-b_{m-1})$ . 8. 8.

$\mathcal{P}_{D^{c}}(y_{m})=\mathcal{P}_{D^{c}}\left(z_{m}-\tau\nabla_{y}H(c_{0,m},c_{m},d_{m},z_{m})\right)$ . 9. 9.

$\mathcal{P}_{D}(y_{m})=J_{\tau\partial G_{3}}\left[\mathcal{P}_{D}\left(z_{m}-\tau\nabla_{y}H(c_{0,m},c_{m},d_{m},z_{m})\right)-\widetilde{y}_{D}\right]+\widetilde{y}_{D}$ . 10. 10.

$z_{m+1}=y_{m}+\frac{\alpha_{m}-1}{\alpha_{m+1}}(y_{m}-y_{m-1})$ .

•

while $|J_{m}-J_{m-1}|>\epsilon$ .

•

Set $(a_{0}^{*},a^{*},b^{*},y^{*})=(a_{0,m},a_{m},b_{m},y_{m})$ .

Algorithm 3.

PALM [4] applying to (24).

•

Initialize: $a_{0,1}$ , $a_{1}$ , $b_{1}$ , $y_{1}$ , $\tau=$ small enough and $\epsilon>0$ (tolerance.)

•

Do

$a_{0,m+1}=a_{0,m}-\tau\nabla_{a_{0}}H(a_{0,m},a_{m},b_{m},y_{m})$ . 2. 2.

$a_{m+1}=J_{\tau\partial G_{1}}\left[a_{m}-\tau\nabla_{a}H(a_{0,m+1},a_{m},b_{m},y_{m})\right]$ . 3. 3.

$b_{m+1}=J_{\tau\partial G_{2}}\left[b_{m}-\tau\nabla_{b}H(a_{0,m+1},a_{m+1},b_{m},y_{m})\right]$ . 4. 4.

$\mathcal{P}_{D^{c}}(y_{m+1})=\mathcal{P}_{D^{c}}\left(y_{m}-\tau\nabla_{y}H(a_{0,m+1},a_{m+1},b_{m+1},y_{m})\right)$ . 5. 5.

$\mathcal{P}_{D}(y_{m+1})=J_{\tau\partial G_{3}}\left[\mathcal{P}_{D}\left(y_{m}-\tau\nabla_{y}H(a_{0,m+1},a_{m+1},b_{m+1},y_{m})\right)-\widetilde{y}_{D}\right]+\widetilde{y}_{D}$ .

•

while $|J_{m+1}-J_{m}|>\epsilon$ .

•

Set $(a_{0}^{*},a^{*},b^{*},y^{*})=(a_{0,m},a_{m},b_{m},y_{m})$ .

Algorithm 4.

Computing an optimal $(a_{0}^{*},a^{*},b^{*},y^{*})$ for (24).

•

Initialize: $c_{0,2}=a_{0,1}$ , $c_{2}=a_{1}$ , $d_{2}=b_{1}$ , $z_{2}=y_{1}$ , $\alpha_{2}=1$ , $\tau=$ small enough and $\epsilon>0$ (tolerance.)

•

Do

$\alpha_{m+1}=(1+\sqrt{1+4\alpha_{m}^{2}})/2$ . 2. 2.

$a_{0,m}=c_{0,m}-\tau\nabla_{a_{0}}H(c_{0,m},c_{m},d_{m},z_{m})$ . 3. 3.

$c_{0,m+1}=a_{0,m}+\frac{\alpha_{m}-1}{\alpha_{m+1}}(a_{0,m}-a_{0,m-1})$ . 4. 4.

$a_{m}=J_{\tau\partial G_{1}}\left[c_{m}-\tau\nabla_{a}H(c_{0,m+1},c_{m},d_{m},z_{m})\right]$ . 5. 5.

$c_{m+1}=a_{m}+\frac{\alpha_{m}-1}{\alpha_{m+1}}(a_{m}-a_{m-1})$ . 6. 6.

$b_{m}=J_{\tau\partial G_{2}}\left[d_{m}-\tau\nabla_{b}H(c_{0,m+1},c_{m+1},d_{m},z_{m})\right]$ . 7. 7.

$d_{m+1}=b_{m}+\frac{\alpha_{m}-1}{\alpha_{m+1}}(b_{m}-b_{m-1})$ . 8. 8.

$\mathcal{P}_{D^{c}}(y_{m})=\mathcal{P}_{D^{c}}\left(z_{m}-\tau\nabla_{y}H(c_{0,m+1},c_{m+1},d_{m+1},z_{m})\right)$ . 9. 9.

$\mathcal{P}_{D}(y_{m})=J_{\tau\partial G_{3}}\left[\mathcal{P}_{D}\left(z_{m}-\tau\nabla_{y}H(c_{0,m+1},c_{m+1},d_{m+1},z_{m})\right)-\widetilde{y}_{D}\right]+\widetilde{y}_{D}$ . 10. 10.

$z_{m+1}=y_{m}+\frac{\alpha_{m}-1}{\alpha_{m+1}}(y_{m}-y_{m-1})$ .

•

while $|J_{m}-J_{m-1}|>\epsilon$ .

•

Set $(a_{0}^{*},a^{*},b^{*},y^{*})=(a_{0,m},a_{m},b_{m},y_{m})$ .

Recall,

[TABLE]

The differentials of $H$ with respect to its variables are as follows.

[TABLE]

where

[TABLE]

Therefore,

[TABLE]

We have

[TABLE]

where

[TABLE]

Similarly,

[TABLE]

where

[TABLE]

Lastly,

[TABLE]

where

[TABLE]

6 Numerical results

In this section, we validate the model (24) with simulated data. We argue that the regularization on the parameters with $\ell^{s}$ , $0<s\leq 1$ , and the sparsity constraint using $\ell^{r}$ , $0<r\leq 1$ , are all crucial and necessary in estimating the parameters accurately. Throughout this section, we simulate data according to conditions (21) and (22) iteratively using $p=6$ , $q=0$ and the true parameters

[TABLE]

with the size of the series $N=1000$ . In all the figures below, $\tilde{y}$ and $\tilde{u}$ are the simulated observed and true mean series using these parameters. We then apply partial series of $\tilde{y}$ to the model described in (24) to reconstruct the extended series $y$ , its mean $u$ and the parameters $a_{0}$ , $a$ . Also plotted in these figures is the vector $D$ with the following interpretation: $D_{i}=1$ implies $\tilde{y}_{i}$ is observed and $D_{i}=0$ implies $\tilde{y}_{i}$ is unobserved.

Example 1.

Figure 3 shows a comparison of performance between Algorithms 3 and 4 for 100 simulations. For each simulation, only $75\%$ of entries are observed and among these entries $2.5\%$ are contaminated. In both algorithms, the parameters used are $\lambda=5,r=0.5,\mu=30$ and $s=1$ . For Algorithm 3 we use $\tau=1e$ - $4$ , and for Algorithm 4 we use $\tau=1e$ - $5$ . Even with a smaller $\tau$ , Algorithm 4 converges in 800 iterations on average, where as it takes on average 10,000 iterations for Algorithm 3 to converge. From the plots we observe that both algorithms provide similar statistics on the estimated parameters.

Example 2.

In this example, using Algorithm 4 we compare the results of estimating the parameters $a_{0}$ and $a$ with and without using regularization on $a$ . Figures 4-6 show the plots of the estimated $a_{0}$ and $a$ for three different amounts of observations (100%, 75% and 50%). In all these cases we use $r=1/2,\lambda=5$ and $s=1$ . The values of $\mu$ changes depending on the amount of missing data. As the amount of missing data increases, the parameter estimation performance deteriorates when no regularization on $a$ is used (i.e. $\mu=0$ ). However, with $\ell^{s}$ regularization on $a$ , the estimated parameters are much closer to the true values.

Example 3.

In this example, using Algorithm 4 we show the significance of having the sparsity constraint $\sum_{i\in D}|y_{i}-\widetilde{y}_{i}|^{r}$ in the model (24) to desensitize anomalies and outliers for obtaining a more accurate parameter estimation. Here we assume all data are observed, that is $D=\{1,\cdots,N\}$ . For each simulation, we randomly select a certain percentage (1%, 5% or 10%) of the data and replace them with some anomalous value (here we pick $20$ as an example). See Figure 7 for an example showing the original series and a contaminated version. To test the significance of the sparsity constraint term, we consider two scenarios:

Assuming the observed data has no outliers and hence enforcing $y_{i}=\widetilde{y}_{i}$ , $i\in D$ . This amounts to picking $\lambda$ to be large, say $\lambda=50$ . 2. 2.

Assuming the observed data has outliers and hence allowing for $y_{i}\neq\widetilde{y}_{i}$ , $i\in D$ , whenever $\widetilde{y}_{i}$ is an anomaly. This amounts to picking $\lambda$ to be small, say $\lambda=2$ .

In all cases the remaining parameters are: $r=1/2,s=1$ and $\mu=10$ . Figures 8-10 show the plots of the estimated parameters $a_{0}$ and $a$ for $100$ simulations with the amount of contaminations to be $1\%$ , $5\%$ and $10\%$ . We observe that enforcing $y_{i}=\widetilde{y}_{i}$ (that is $\lambda=50$ ) greatly alters the reconstruction of $a_{0}$ and $a$ and this error increases as the amount of contamination increases. This enforcement causes an increase in the mean value (see Figures 11-13) and a decrease in the absolute values of the correlation coefficients in the reconstructed time series (see Figures 8-10.) Letting $\lambda=2$ , which is small, prevents the model from fitting $y_{i}$ to the anomalous $\widetilde{y}_{i}$ . As a result, the estimated parameters $a_{0}$ and $a$ are much closer to the ground truth. The inaccuracy in parameter estimation effects the reconstruction of the mean series, and hence the prediction. The results are visibly seen in Figures 11-13.

Example 4.

Figures 14-15 show numerical results of 100 simulated data that have both missing entries and contamination among the observed ones using Algorithm 4. In Figure 14, $75\%$ of the entries are observed and $2.5\%$ of these entries are contaminated with outliers. The parameters used are $\lambda=5,r=0.5,\mu=30$ and $s=1$ . In Figure 15, $50\%$ of the entries are observed and $2.5\%$ of these entries are contaminated with outliers. The parameters used are $\lambda=5,r=0.5,\mu=60$ and $s=1$ .

Example 5.

In the proposed model (24), $p$ and $q$ are given a prior, and in all previous examples we assume $p=6$ and $q=0$ are known. However in real applications, these values need to be estimated. Model selection, that is picking the right choice for $p$ and $q$ , is a difficult problem to tackle. Current approaches involve running the algorithm on various choices of $p$ and $q$ and then use criterion such as AIC [1], BIC [36], etc. to pick the optimal values. We argue that the $\ell^{s},0<s\leq 1,$ constraint on the parameters $a$ and $b$ removes the model selection task from the problem and lets the model discover an optimal sparse solution for $a$ and $b$ . In this example, we choose $p=15$ and $q=0$ . Figure 16 shows the box plots of the estimated parameters $a_{0}$ and $a$ for 100 simulations. For each simulation, only $75\%$ of entries are observed and $2.5\%$ of the observed entries are contaminated. The parameters used are: $\lambda=5,r=0.5,\mu=10,$ and $s=0.75$ .

Discussion: In this paper we present an autoregressive time series model to robustly learn the parameters (the mean and correlation coefficients) in the presence of noise, outliers and missing entries. In the presence of outliers or anomalies, we show that the nonconvex sparsity constraint desensitizes outliers and as a result the model provides a more robust estimation of the parameters. In the presence of missing entries we show that the constraint $\ell^{s}$ , $0<s\leq 1$ , on the parameters significantly improves the accuracy of the estimated parameters. Model selection, that is picking the right choice for $p$ and $q$ , is a difficult problem to tackle in time series analysis. Current approaches involve estimating the parameters for various choices of $p$ and $q$ and then use criterion such as AIC, BIC, etc. to pick the optimal $p$ and $q$ . The $\ell^{s},0<s\leq 1,$ constraint on the parameters $a$ and $b$ removes the model selection task from the problem and lets the model select an optimal sparse solution for $a$ and $b$ . However, in return one needs to provide the parameter $\lambda$ and $\mu$ as inputs. We also mention that the proposed model (15) can be applied to other types of noise besides additive Gaussian or Poisson. Moreover, this model can also be extended to a multivariate case.

Appendix A Appendix

We remark that this is a standard technique for deriving the likelihood function, see for instance [27]. For completeness, we show the likelihood function for our problem here in the following proposition.

Proposition 1.

Given the observed series $\tilde{y}_{D}=\{\tilde{y}_{i}\}_{i\in D}$ , the minimization problem (15) is equivalent to

[TABLE]

for some fixed $\lambda>0$ and $0<r\leq 1$ .

Proof.

Assume for all $i\in D$ , $\tilde{y}_{i}=y_{i}+x_{i}$ , where $x_{i}$ is an additive random noise with

[TABLE]

This implies

[TABLE]

Using Bayes’ law, the joint probability is given by

[TABLE]

Since $\tilde{y}_{i}$ is completely determined by $y_{i}$ and $\lambda$ (given), we have

[TABLE]

This implies

[TABLE]

Recursively apply the same technique to $P(\{\tilde{y}_{j}\}_{j\in D\setminus\{i\}},y,f,\theta)$ , we get

[TABLE]

As for $P(y,f,\theta)$ , we first note that $u_{i}$ is completely determined by $\{y_{j}\}_{j<i}$ , $f$ and $\theta$ . This implies

[TABLE]

Since $y_{N}$ is completely determined by $u_{N}$ and $\theta_{N}$ , we get

[TABLE]

This implies

[TABLE]

Recursively apply the same technique to $P(y_{N-1},\cdots,y_{1},f,\theta)$ , we get

[TABLE]

Combining (39) and (40), we have

[TABLE]

where $P(f,\theta)$ is the joint prior on $f$ and $\theta$ .

The maximization problem (38) is equivalent to

[TABLE]

where

[TABLE]

Treating $r$ and $\lambda$ as fixed constants, we see that (42) is equivalent to

[TABLE]

which is (15) if we assume $f$ and $\theta$ are independent. ∎

Remark 5.

Using the same techniques as above, we see that the (-)log-likelihood function corresponding to the energy 24 is given by

[TABLE]

Bibliography41

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Hirotugu Akaike. A new look at the statistical model identification. Automatic Control, IEEE Transactions on , 19(6):716–723, 1974.
2[2] Gilbert Bassett Jr and Roger Koenker. Asymptotic theory of least absolute error regression. Journal of American Statistical Association , 73(363):618–622, 1978.
3[3] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences , 2(1):183–202, 2009.
4[4] Jerome Bolte, Shoham Sabach, and Marc Teboulle. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Mathematical Programming , 146(1-2):459–494, 2014.
5[5] George EP Box and Gwilym Jenkins. Time series analysis: Forecasting and control. Holden-D. iv, San Francisco, 1970 .
6[6] Ammanuel J Candes and Benjamin Retch. Exact matrix completion via convex optimization. Foundations of computational mathematics , 9(6):717–772, 2009.
7[7] Emmanuel J Candes, Justin Romberg, and Terence Tao. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. Information Theory, IEEE Transactions on , 52(2):489–509, 2006.
8[8] Emmanuel J Candes, Justin K Romberg, and Terence Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on pure and applied mathematics , 59(8):1207–1223, 2006.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

A Robust Time Series Model with Outliers and Missing Entries

Abstract

1 Introduction

Remark 1**.**

2 Prior Work

3 Poisson Log-Linear Model

Remark 2**.**

4 Proximal Mapping for ℓr\ell^{r}ℓr, 0<r<10<r<10<r<1

Remark 3**.**

Algorithm 1**.**

Remark 4**.**

5 Numerical Implementation

Algorithm 2**.**

Algorithm 3**.**

Algorithm 4**.**

6 Numerical results

Example 1**.**

Example 2**.**

Example 3**.**

Example 4**.**

Example 5**.**

Appendix A Appendix

Proposition 1**.**

Proof.

Remark 5**.**

Remark 1.

Remark 2.

4 Proximal Mapping for $\ell^{r}$ , $0<r<1$

Remark 3.

Algorithm 1.

Remark 4.

Algorithm 2.

Algorithm 3.

Algorithm 4.

Example 1.

Example 2.

Example 3.

Example 4.

Example 5.

Proposition 1.

Remark 5.