Adaptation and learning over networks for nonlinear system modeling

Simone Scardapane; Jie Chen; C\'edric Richard

arXiv:1704.08913·stat.ML·May 1, 2017

Adaptation and learning over networks for nonlinear system modeling

Simone Scardapane, Jie Chen, C\'edric Richard

PDF

Open Access

TL;DR

This paper explores distributed nonlinear system modeling, emphasizing the distinction between single-task and multitask problems, and introduces a kernel-based algorithm for multitask scenarios evaluated on a benchmark.

Contribution

It introduces a simple kernel-based algorithm tailored for multitask nonlinear system modeling in distributed environments, addressing a gap in existing literature.

Findings

01

The proposed algorithm performs well on simulated benchmark tasks.

02

Multitask modeling offers advantages over single-task approaches in distributed settings.

03

Open problems and future research directions are discussed.

Abstract

In this chapter, we analyze nonlinear filtering problems in distributed environments, e.g., sensor networks or peer-to-peer protocols. In these scenarios, the agents in the environment receive measurements in a streaming fashion, and they are required to estimate a common (nonlinear) model by alternating local computations and communications with their neighbors. We focus on the important distinction between single-task problems, where the underlying model is common to all agents, and multitask problems, where each agent might converge to a different model due to, e.g., spatial dependencies or other factors. Currently, most of the literature on distributed learning in the nonlinear case has focused on the single-task case, which may be a strong limitation in real-world scenarios. After introducing the problem and reviewing the existing approaches, we describe a simple kernel-based…

Figures14

Click any figure to enlarge with its caption.

Equations81

J_{k}(\mathbf{w}_{k})=\mathbb{E}\biggl{\{}L\bigl{(}d_{k},f_{k}(\mathbf{u}_{k})\bigr{)}\biggr{\}}\,.

J_{k}(\mathbf{w}_{k})=\mathbb{E}\biggl{\{}L\bigl{(}d_{k},f_{k}(\mathbf{u}_{k})\bigr{)}\biggr{\}}\,.

\underset{\mathbf{w}_{1},\ldots,\mathbf{w}_{N}\in\R^{q}}{\min}\biggl{\{}J_{\text{glob}}(\mathbf{w}_{1},\ldots,\mathbf{w}_{N})\biggr{\}}=\sum_{k=1}^{N}J_{k}(\mathbf{w}_{k})\,,

\underset{\mathbf{w}_{1},\ldots,\mathbf{w}_{N}\in\R^{q}}{\min}\biggl{\{}J_{\text{glob}}(\mathbf{w}_{1},\ldots,\mathbf{w}_{N})\biggr{\}}=\sum_{k=1}^{N}J_{k}(\mathbf{w}_{k})\,,

k = 1 \sum N A_{k l} = 1, A_{k l} \geq 0 for any k, l = 1, \dots, N .

k = 1 \sum N A_{k l} = 1, A_{k l} \geq 0 for any k, l = 1, \dots, N .

ϕ_{k, n}

ϕ_{k, n}

w_{k, n}

σ (s) = \frac{1}{1 + exp { - s }},

σ (s) = \frac{1}{1 + exp { - s }},

J_{k}(\mathbf{w})=\mathbb{E}\biggl{\{}-d_{k}\cdot\log\bigl{(}f_{k}(\mathbf{u}_{k})\bigr{)}-(1-d_{k})\cdot\log\bigl{(}1-f_{k}(\mathbf{u}_{k})\bigr{)}\biggr{\}}+\frac{\lambda}{2N}\left\lVert\mathbf{w}_{k}\right\rVert_{2}^{2}\,,

J_{k}(\mathbf{w})=\mathbb{E}\biggl{\{}-d_{k}\cdot\log\bigl{(}f_{k}(\mathbf{u}_{k})\bigr{)}-(1-d_{k})\cdot\log\bigl{(}1-f_{k}(\mathbf{u}_{k})\bigr{)}\biggr{\}}+\frac{\lambda}{2N}\left\lVert\mathbf{w}_{k}\right\rVert_{2}^{2}\,,

\bm{\phi}_{k,n}=\mathbf{w}_{k,n-1}+\mu_{k}\bigl{(}d_{k}(n)-f_{k,n-1}(\mathbf{u}_{k,n})\bigr{)}\mathbf{u}_{k,n}-\frac{1}{N}\mu_{k}\lambda\mathbf{w}_{k,n-1}\,.

\bm{\phi}_{k,n}=\mathbf{w}_{k,n-1}+\mu_{k}\bigl{(}d_{k}(n)-f_{k,n-1}(\mathbf{u}_{k,n})\bigr{)}\mathbf{u}_{k,n}-\frac{1}{N}\mu_{k}\lambda\mathbf{w}_{k,n-1}\,.

J^{glob} (w_{1}, \dots, w_{N}) = k = 1 \sum N J_{k} (w_{k}) + η k = 1 \sum N l \neq = k, l \in N_{k} \sum ρ_{k l} ∥ w_{k} - w_{l} ∥_{2}^{2},

J^{glob} (w_{1}, \dots, w_{N}) = k = 1 \sum N J_{k} (w_{k}) + η k = 1 \sum N l \neq = k, l \in N_{k} \sum ρ_{k l} ∥ w_{k} - w_{l} ∥_{2}^{2},

l = 1 \sum N ρ_{k l} = 1, and ρ_{k l} = 0 if l \in / N_{k}, \forall k \in {1, \dots, N} .

l = 1 \sum N ρ_{k l} = 1, and ρ_{k l} = 0 if l \in / N_{k}, \forall k \in {1, \dots, N} .

w_{k, n} = w_{k, n - 1} - μ_{k} \nabla J_{k} (w_{k, n - 1}) - μ_{k} η l \neq = k, l \in N_{k} \sum \frac{( ρ _{k l} + ρ _{l k} )}{2} (w_{k, n - 1} - w_{l, n - 1}),

w_{k, n} = w_{k, n - 1} - μ_{k} \nabla J_{k} (w_{k, n - 1}) - μ_{k} η l \neq = k, l \in N_{k} \sum \frac{( ρ _{k l} + ρ _{l k} )}{2} (w_{k, n - 1} - w_{l, n - 1}),

w_{k, n} =

w_{k, n} =

μ_{k} η l \neq = k, l \in N_{k} \sum \frac{( ρ _{k l} + ρ _{l k} )}{2} (w_{k, n - 1} - w_{l, n - 1}) .

h_{i} (u) = \frac{1}{1 + exp { - a _{i}^{T} u - b _{i} }},

h_{i} (u) = \frac{1}{1 + exp { - a _{i}^{T} u - b _{i} }},

K (u_{1}, u_{2}) \approx ⟨ h (u_{1}), h (u_{2})⟩ .

K (u_{1}, u_{2}) \approx ⟨ h (u_{1}), h (u_{2})⟩ .

d_{k} (n) = ψ_{k}^{o} (u_{k, n}) + ν_{k} (n),

d_{k} (n) = ψ_{k}^{o} (u_{k, n}) + ν_{k} (n),

ψ_{k}^{o} = ψ^{o} \forall k \in {1, \dots, N},

ψ_{k}^{o} = ψ^{o} \forall k \in {1, \dots, N},

\nabla J_{k}(\psi_{k})=-2\mathbb{E}\biggl{\{}\bigl{(}d_{k}-\psi_{k}(\mathbf{u}_{k})\bigr{)}\kappa(\cdot,\mathbf{u}_{k})\biggr{\}}\,,

\nabla J_{k}(\psi_{k})=-2\mathbb{E}\biggl{\{}\bigl{(}d_{k}-\psi_{k}(\mathbf{u}_{k})\bigr{)}\kappa(\cdot,\mathbf{u}_{k})\biggr{\}}\,,

δ_{k, n}

δ_{k, n}

ψ_{k, n}

ψ_{k, n} = β_{k, n}^{T} k_{k, n},

ψ_{k, n} = β_{k, n}^{T} k_{k, n},

δ_{k, n}

δ_{k, n}

β_{k, n}

d_{k} (n) = f_{k}^{o} (w_{k}^{T} u_{k, n}) + ν_{k} (n) .

d_{k} (n) = f_{k}^{o} (w_{k}^{T} u_{k, n}) + ν_{k} (n) .

ψ_{k, n - 1}

ψ_{k, n - 1}

ξ_{k, n - 1} = l \in N_{k} \sum A_{l k} q_{i, l, n - 1} .

ξ_{k, n - 1} = l \in N_{k} \sum A_{l k} q_{i, l, n - 1} .

y_{k} (n) = u^{T} B ξ_{k, n - 1} .

y_{k} (n) = u^{T} B ξ_{k, n - 1} .

B = \frac{1}{2} - 1 2 - 1 0 3 - 5 02 - 3 410 1 - 1 00 .

B = \frac{1}{2} - 1 2 - 1 0 3 - 5 02 - 3 410 1 - 1 00 .

w_{k, n}

w_{k, n}

q_{i, k, n}

ψ_{k}^{o} \sim ψ_{l}^{o} if l \in N_{k},

ψ_{k}^{o} \sim ψ_{l}^{o} if l \in N_{k},

J^{\text{glob}}(\psi_{1},\ldots,\psi_{N})=\sum_{k=1}^{N}\mathbb{E}\left\{\bigl{\lvert}d_{k}(n)-\psi_{k}(\mathbf{u}_{k,n})\bigr{\rvert}^{2}\right\}+\eta\sum_{k=1}^{N}\sum_{l\neq k,l\in\mathcal{N}_{k}}\rho_{kl}\left\lVert\psi_{k}-\psi_{l}\right\rVert_{\mathcal{H}}^{2}\,,

J^{\text{glob}}(\psi_{1},\ldots,\psi_{N})=\sum_{k=1}^{N}\mathbb{E}\left\{\bigl{\lvert}d_{k}(n)-\psi_{k}(\mathbf{u}_{k,n})\bigr{\rvert}^{2}\right\}+\eta\sum_{k=1}^{N}\sum_{l\neq k,l\in\mathcal{N}_{k}}\rho_{kl}\left\lVert\psi_{k}-\psi_{l}\right\rVert_{\mathcal{H}}^{2}\,,

l = 1 \sum N ρ_{k l} = 1, and ρ_{k l} = 0 if l \in / N_{k}, \forall k \in {1, \dots, N} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEnergy Efficient Wireless Sensor Networks · Distributed Sensor Networks and Detection Algorithms · Distributed Control Multi-Agent Systems

Full text

Adaptation and learning over networks

for nonlinear system modeling

Simone Scardapane

[email protected]

Jie Chen

Cédric Richard

Department of Information Engineering, Electronics and Telecommunications, Sapienza University of Rome, Via Eudossiana 18, 00184 Rome, Italy

Northwestern Polytechnical University, Xi’an, School of Marine Science and Technology, 127 West Youyi Road, 710072, Xi’an (China)

Université Côte d’Azur, Laboratoire Lagrange (UMR CNRS 7293), Parc Valrose, 06108, Nice Cedex 2 (France)

Abstract

In this chapter, we analyze nonlinear filtering problems in distributed environments, e.g., sensor networks or peer-to-peer protocols. In these scenarios, the agents in the environment receive measurements in a streaming fashion, and they are required to estimate a common (nonlinear) model by alternating local computations and communications with their neighbors. We focus on the important distinction between single-task problems, where the underlying model is common to all agents, and multitask problems, where each agent might converge to a different model due to, e.g., spatial dependencies or other factors. Currently, most of the literature on distributed learning in the nonlinear case has focused on the single-task case, which may be a strong limitation in real-world scenarios. After introducing the problem and reviewing the existing approaches, we describe a simple kernel-based algorithm tailored for the multitask case. We evaluate the proposal on a simulated benchmark task, and we conclude by detailing currently open problems and lines of research.

keywords:

Nonlinear system identification, distributed systems, adaptive methods, reproducing kernel Hilbert spaces, diffusion algorithms

††journal: Neural Networks

To be published as a chapter in ‘Adaptive Learning Methods for Nonlinear System Modeling’, Elsevier Publishing, Eds. D. Comminiello and J.C. Principe (2018)

1 Introduction

Adaptive filters have been at the heart of digital signal processing over the last century, thanks to their capability of rapidly adapting to streams of incoming data. At the same time, classical filtering approaches have not been satisfactory to handle the challenges posed by large-scale, unstructured big data scenarios that are common today. This has fostered the recent development of techniques to deal with such problems, allowing signal processing to scale to truly massive datasets [1], and to work with less structured data types, such as graphs [2, 3]. In this chapter, we look at one peculiar aspect unifying several big data problems, namely, their distributed nature, where the data are naturally generated and aggregated at different locations with possibly poor or expensive network connectivity. Examples of problems in this category abound, including (but not limited to), wireless sensor networks (WSNs) [4], distributed databases, robotic swarms, fog computing platforms, among others. In all these scenarios, the agents in the network can be severely limited in their capabilities, in terms of either energy constraints (e.g., low-power devices in WSNs), connectivity, privacy, or other aspects. As a consequence, any solution devised for learning and inference over networks needs to be aware of these constraints, making this a challenging problem with wide applications.

Distributed learning can be cast as a decentralized optimization problem, which has a long history in the optimization field [5] and in artificial intelligence. In recent years, this problem has gained a renewed interest from the machine learning community, with the development of a number of learning protocols for a variety of models, including boosting [6], support vector machines [7, 8], kernel regression [4], and sparse linear models [9, 10], to cite a few. Several of these were also applied in signal processing problems, most notably in order to provide distributed inference capabilities in WSNs [4]. A large majority of them, however, is only applicable in batch situations, where each agent is allowed to manipulate its entire (local) dataset at each iteration. Distributed filtering algorithms, on the contrary, require the development of online solutions, where the data are received and processed, sequentially, by the agents.

In the filtering literature, a recent series of works was initiated by the development of the so-called ‘diffusion filtering’ (DF) algorithms, starting from the diffusion LMS [11] and the diffusion RLS [12], up to their more general formulation in terms of generic convex cost functions [13, 14, 15], which are considered here. DF algorithms are characterized by interleaving local updates in parallel (mimicking classical filters), with communication steps, during which the agents exchange information on their current estimate with their neighbors. Following the development of the main theory, in the subsequent years, several authors have extended classical linear filters to the distributed case (e.g., sparse LMS [10] and group LMS [16]), while others have focused on nonlinear filters, as we review further on in the chapter. For the interested reader, we refer to the guest editorial in [17] and references therein for a recent overview of the literature.

All the works mentioned up to now assume that the agents are identifying a common model. This is a reasonable assumption in several contexts, particularly when the statistics of the data do not depend on the spatial locations of the agents, making it very common in distributed machine learning problems [7, 15]. In general, however, the agents could be interested in different identification problems, which are similar in some quantifiable sense. As an example, consider the setup illustrated in Fig. 1. If the agents are sensors deployed over an environment, trying to predict some quantity of interest (e.g., some pollutant concentration), the model might be different among groups (clusters) of agents, possibly due to the spatial conformation of the ground (e.g., sensors deployed over a mountain v.s. sensors deployed in a valley). Nonetheless, since they are all trying to predict the same quantity of interest, communication between agents belonging to different clusters can be beneficial.

The first DF solution to this problem, which is termed multitask network, was analyzed in [18]. Following this, a range of algorithms was proposed in the linear case. Most notably, Chen et al. [19] and Zhao and Sayed [20] investigated the possibility of unsupervised learning of the clusters structure when it is not available a priori. Additional developments include the extension to asynchronous networks where, e.g., links may fail or agents might disconnect [21]; proximal updates for nondifferentiable regularization terms [22]; total least squares approaches [23]; and, finally, multitask learning over (linear) latent subspaces [24, 25]. Almost no work, however, has addressed the problem of learning in a multitask network with nonlinear models.

Based on the previous discussion, this chapter has three separate aims. First, we introduce the fundamental concept of DF in Section 2, which serves as a very general introduction to the topic. Next, we summarize recent works on nonlinear DF algorithms in Section 3, with an emphasis on three classes of solutions. In order to motivate research on multitask learning with nonlinear models, in Section 4 we propose a multitask kernel algorithm based on a functional formulation of DF. Finally, we validate the algorithm on an experimental benchmark in Section 5, before making some final remarks in Section 6.

2 Mathematical formulation of the problem

This section is intended to familiarize the reader with some basic theoretical elements underlying most distributed learning scenarios. We start by providing a setup for the problem in Section 2.1. Next, we describe a general class of algorithms based on diffusion protocols in Section 2.2. In Section 2.3, we show how these algorithms can be customized to address multitask scenarios. For conciseness, we only focus on a selection of key items, without providing a comprehensive treatment of these thematics. We refer the interested reader to [15, 14] for introductory references.

2.1 Problem setup

Let us consider a generic network of $N$ agents (e.g., sensors in a WSN) as the one depicted in Fig. 1. We assume that time is slotted and, at every time instant $n$ , each agent receives a new observation $\left(\mathbf{u}_{k,n},d_{k}(n)\right)$ , where $\mathbf{u}_{k,n}\in\R^{M}$ is the model input vector at agent $k$ (e.g., a buffer of the last $M$ samples), and $d_{k}(n)$ the corresponding desired response. For simplicity, we assume that $d_{k}(n)$ is a scalar. For the rest of the chapter, we shall use the subscript $k$ to denote a quantity specific to one of the agents.

Following the standard supervised learning approach, the desired input/output relation can be modeled by choosing a function $f$ in some hypothesis space $\mathcal{H}$ . For simplicity, in this section we suppose that each function is parameterized by a vector of tunable parameters $\mathbf{w}\in\R^{q}$ , e.g., a linear predictor.111Distributed kernel filters (Section 3.2) are an example of a non-parametric formulation, which is recast as a parametric problem thanks to the representer’s theorem. With this setting, each agent is interested in finding a set of parameters $\mathbf{w}^{*}_{k}$ which minimizes some local cost function $J_{k}(\cdot)$ defined over the hypothesis space from streaming data. Specifically, for most estimation problems in practice, these local cost functions are defined as the expectation of some error function $L(\cdot,\cdot)$ with respect to the statistics of the local stream of data:

[TABLE]

where we use the shorthand $f_{k}(\mathbf{u}_{k})=f(\mathbf{u}_{k};\mathbf{w}_{k})$ . The global optimization problem to be solved at the network level is then given by the sum of the local cost functions:

[TABLE]

where $\mathbf{w}_{k}$ is the estimate at the $k$ th agent. If we assume that no relation holds between the local cost functions, then (2) reduces to a set of $N$ optimization problems that can be solved in parallel by every agent, independently of all the others. A more interesting formulation arises by assuming some form of relation among the cost functions (detailed below). In this case, the information gathered by one agent during its optimization process can potentially be used by the other agents to speedup their convergence, or even converge to a better solution using some shared information.

The difficulty arises from the fact that each agent has direct access to its local cost function, but it has no access to the local cost functions of the other agents for all the reasons mentioned in the introduction. Depending on the relation between cost functions, we can distinguish between three classes of distributed problems:

Single-task problems: in this case, $\mathbf{w}^{*}_{k}=\mathbf{w}^{*},\,k=1,\ldots,N$ , i.e., all the cost functions have the same minimizer which must be attained by all agents. This is the scenario which has drawn most attention in the literature, being particularly useful in distributed machine learning problems [7], where it is common to assume that the data of interest are generated by a single underlying distribution.

2.

Multitask problems: in this scenario, each local cost function has possibly a different minimizer $\mathbf{w}_{k}$ .222Some readers might recognize that in the machine learning literature, the term ‘multitask learning’ is employed in a slightly different meaning. It refers to the problem of solving several learning tasks defined on the same (or in similar) input domain(s) [26, 27]. While the setup and the objectives in the two cases do not perfectly overlap, we speculate that exploring the connections between them is of particular significance, particularly due to the increasing interest given by deep neural networks [28]. In order to make the problem interesting, we assume that these minimizers are ‘similar’ (in some sense to be properly defined) among pairs of neighboring agents, so that communicating can increase their speed of convergence and possibly counter noisy environments.

3.

Clustered multitask problems: in this intermediate case, each agent belongs to one of $T$ different groups (clusters), such that all agents belonging to the same cluster have the same minimizer, and vice versa, as shown in Fig. 1. Clearly, both single-task and multitask problems can be derived as extreme cases of this class of problems, by setting $T=1$ and $T=N$ .

Multitask problems can be further subdivided, depending on whether the similarity between tasks is known a priori, or whether it must be inferred from the data. Examples of the former case are the algorithm in [18], while examples of the latter case can be found in [19, 20]. Inferring knowledge about the groups might require the inclusion of some decentralized clustering procedure in the learning process, which is an interesting problem in its own right [29]. For simplicity, in this chapter we will focus on the multitask formulation, but we underline that almost all multitask algorithms can be generalized to handle the clustered multitask case, e.g., see [18].

2.2 Diffusion-based algorithms

In order to describe a family of algorithms to solve the previous problem, we first need to define a model of communication between the different agents. Most of the literature focuses on the case where the communication links form a static, undirected, connected graph $\mathcal{G}$ . At each time step, the $k$ th agent is allowed to communicate with its set of direct neighbors $\mathcal{N}_{k}$ , while it cannot send messages to agents to which it is not directly linked.333The graph describes all feasible communication links. This is a relatively general formulation, since every multi-hop network can be described with an equivalent single-hop network by considering all possible paths as a direct link in the corresponding graph. Nonetheless, the overall connectedness of the graph ensures that information can flow throughout the entire network. In this scenario, connectivity can be described by a symmetric, real-valued matrix $\mathbf{A}\in\R^{N\times N}$ such that $A_{kl}\neq 0$ only if agents $k$ and $l$ are connected, and:

[TABLE]

These weights are used by the agents to scale and combine information received by their neighbors. The previous condition ensures that each row of the matrix defines a convex combination, so that the range of the information to be combined is always preserved (formally, the condition requires that $\mathbf{A}$ be left stochastic). There are several strategies allowing agents to build such matrices, as we show later. This formulation can also be extended in several ways, most notably with the use of asynchronous formulations [30] (thus avoiding the need for a common clock throughout the network), and mixing matrices that do not respect double stochasticity [31]. Nonetheless, since this chapter is only intended as an introduction, we will focus on the simpler case detailed before.

In the case of a single agent, (2) could be solved by a simple gradient descent algorithm. In order to counteract the lack of global information, the basic idea of diffusion algorithms is to interleave local optimization steps with communication steps, where each agent combines its own estimate with those of its neighbors. In the filtering literature, this strategy is generally denoted as adapt-then-combine (ATC):

[TABLE]

where $\mu_{k}$ is a (possibly time-dependent) step size and, similarly to before, we use a double subscript $(k,n)$ to denote the estimate of agent $k$ at time $n$ . Practically, the gradient term in (4) can be substituted with a noisy version, for example using an instantaneous approximation computed from the current data sample. An example of diffusion step is given in Fig. 2.

These algorithms are particularly suitable in single-task scenarios, where their convergence properties in the convex case have been analyzed extensively [14, 15]. Interestingly, they can still have good convergence properties in the multitask case, as shown in [19]. However, they provide a building block for all other formulations, as we show in the next subsection. Before that, we describe an example of diffusion algorithm for nonlinear learning with a distributed logistic regression model.

2.2.1 An example: logistic regression over networks

Consider a binary classification problem where $d_{k}(n)\in\left\{0,1\right\}$ , i.e., each output is a single bit representing whether the corresponding input belongs to a certain class or not. We approximate the underlying relation using a logistic predictor $f(\mathbf{u})=\sigma(\mathbf{w}^{T}\mathbf{u})$ , where $\sigma(\cdot)$ is the sigmoid function:

[TABLE]

ensuring that the outputs of the model are properly scaled as valid probabilities. Each agent wishes to minimize the (regularized) expected cross-entropy over its stream of data:

[TABLE]

where the factor $1/N$ in the regularization term ensures that the total penalization in (2), when summed over all agents, is equal to $\frac{\lambda}{2}$ . By taking instantaneous approximations to the gradient, simple algebra manipulations show that the update steps in (4) are given by:

[TABLE]

In order to show the speedup obtained by such procedure, in Fig. 3 we plot the average accuracy (see below) obtained with a network of $20$ agents, whose connectivity is generated randomly, and where at every iteration each agent receives a randomly chosen example taken from the well-known Wisconsin Breast Cancer Database (WBCD).444https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic) The accuracy is defined as $1$ if the agent makes a correct prediction (i.e., the sign of $f(\mathbf{u})$ agrees with $d$ ), [math] otherwise. It is computed before the adaptation step, and it is averaged with respect to the different agents and the different simulations.

2.3 Extension to multitask learning

The general ideas exposed in the previous section can be easily extended in order to be more efficient in the multitask scenario. Here, we detail a simple extension originally proposed in [18] to show one example of such extensions. We refer to the large number of works cited in the introduction for more recent proposals.

Suppose that, given two agents $k$ and $l$ , we have a way to quantify the similarity among their respective minimizers. In order to leverage this information, we can augment the original cost function in (2) with a regularization term forcing the similarity among minimizers, in terms of their Euclidean distance:

[TABLE]

where $\eta$ is a regularization factor, and the nonnegative coefficients $\rho_{kl}\geq 0$ quantify our a priori knowledge about the similarities. We assume that, for each agent, the weights are positive and sum to one:

[TABLE]

As a consequence of their definition, these regularization factors mirror the communication network of the agents. Due to this, taking a local optimization step with respect to the estimate of the $k$ th agent immediately requires a diffusion step:

[TABLE]

Since the estimates are already exchanged, the previous update step can be implemented without the need for additional combination steps like in the previous section. As an example, by considering a simple linear predictor $f_{k}(\mathbf{u})=\mathbf{w}_{k}^{T}\mathbf{u}$ and instantaneous approximations to the gradient, we obtain the update rule for the multitask diffusion LMS presented in [18]:

[TABLE]

If we assume that the mixing weights are symmetric, $\frac{\left(\rho_{kl}+\rho_{lk}\right)}{2}$ simplifies to $\rho_{kl}$ . It is possible to obtain asymmetric regularization terms by considering a game-theoretical formulation of the optimization problem, see the discussion in [18].

3 Existing approaches to nonlinear distributed filtering

In this section we describe three approaches to extend the previous formulation to nonlinear models in an efficient way. We underline that all these algorithms have been devised mostly for the case of single-task networks. Further extensions to the multitask scenario are the topic of the next section.

3.1 Expansion over random bases

One immediate idea is to project the original input vector $\mathbf{u}$ to a high-dimensional space via some fixed function $\mathbf{h}(\mathbf{x}):\R^{M}\rightarrow\R^{B}$ before using it with a linear predictor, where $d$ is the dimensionality of the input vector $\mathbf{u}$ , and $B$ is a parameter which (in general) can be chosen by the user. In the distributed case, this requires only a small communication overhead in the beginning for the agents to agree on a specific projection function. Any distributed linear algorithm, such as the diffusion LMS or the diffusion RLS, can then be used.

Generally speaking, deterministic mappings (such as those mentioned in Chapters 2 and 3) are not efficient, because their size might grow exponentially with respect to $M$ . A different idea is to use basis functions whose parameters are assigned stochastically, e.g. a parameterized sigmoid:

[TABLE]

where $\mathbf{a}_{i}$ and $b_{i}$ might be extracted randomly from some uniform distribution (whose range is generally chosen in order to provide a good accuracy, see the discussion in [32]). Interestingly, it is possible to show that the resulting estimator (called a random vector functional-link network) is a universal approximator over compact functions provided that $B$ is chosen large enough [33]. It is also possible to interpret it as a degenerate case of the echo state network described in Chapter 12, where connections between different nodes have been removed. The idea of using RVFL networks in a distributed context was proposed in [34] for the batch case, and in [35] for the online case with DF algorithms.

A different approach is to design a feature mapping $\mathbf{h}(\cdot)$ approximating a specific kernel $\mathcal{K}(\cdot,\cdot)$ function555We refer to Chapters 6-8 for introductory material on kernel filters. chosen by the user:

[TABLE]

This idea was popularized by [36] for approximating shift-invariant kernels (e.g., the Gaussian kernel) in large-scale applications of kernel methods. In particular, it is possible to show that this class of kernels can be easily approximated with very simple stochastic mappings. [37] was the first to apply this idea explicitly to kernel filters, and similar algorithms were independently reintroduced in [38]. Since Chapter 8 is entirely devoted to this idea, we will not go further into it. We refer the interested reader to [32] for a recent overview on random feature methods.

3.2 Distributed kernel filters

An alternative line of research is devoted to distributed strategies for kernel filters, working directly on some reproducing kernel Hilbert space (RKHS), instead of approximating the kernel function as in the previous section. As we stated in the introduction, several distributed algorithms for kernel ridge regression were devised in the context of WSNs [4], followed by algorithms for the distributed optimization of SVMs [39, 7]. Any approach to dealing with kernels faces the challenge of working with a kernel-based model that depends explicitly on all the data in the training set. A naïve distributed implementation would thus require to exchange all the local datasets between the agents, which can become infeasible.

In an online context, this is made worse by the growing nature of the kernel model [40]. This is a fundamental drawback underlying any kernel filter algorithm [41, 42, 43]. An initial investigation in developing a fully distributed version of the kernel LMS (KLMS) was made in [44], where the basic idea is to consider diffusion algorithms directly in a functional form. In particular, let us assume that the data received from the $k$ th agent satisfies a model of the form:

[TABLE]

where $\psi_{k}^{o}$ belongs to a RKHS $\mathcal{H}$ , while $\nu_{k}(n)$ is a zero-mean white noise with variance $\sigma_{k}^{2}$ . Restricting our attention to a generic single-task network, we have:

[TABLE]

Considering the classical squared error function, the gradient of the local cost functions can now be computed in terms of their Fréchet derivatives as:666A functional derivative is needed because the dimentionality of $\mathcal{H}$ can be infinite. See [45] for an introduction to Fréchet derivatives in the context of kernel methods, and [46] for an introductory textbook on functional analysis.

[TABLE]

where $\kappa$ is the kernel function associated to the RKHS. Considering instantaneous approximations for the expectation as was done earlier, we arrive at a functional equivalent of the ATC diffusion framework:

[TABLE]

Although this formulation is extremely general, one has still to solve the problem of the growing structure of the kernel functions. The idea pursued in [44] is to assume some shared dictionary $\mathcal{D}$ among nodes, whose selection is (at the moment) an open research question. Using this approximation, we can rewrite the desired function as:

[TABLE]

where $\mathbf{k}_{k,n}$ is the vector of kernel values computed between the current input vector $\mathbf{u}_{k,n}$ and the shared dictionary $\mathcal{D}$ . Each function $\psi_{k,n}$ is now parameterized by the set of linear coefficients $\bm{\beta}_{k,n}$ . The previous algorithm can be rewritten as:777Note that we consider a simplified formulation with respect to [44], where two combination steps are used. Also, we use the same symbol $\bm{\delta}_{k}$ for the result of the adaptation step as in (16), but in boldface to underline that it is now a vector-valued quantity.

[TABLE]

The idea of preselecting a dictionary is not new in the kernel literature. In fact, one of the earliest algorithms for distributed SVMs [39] exploited a similar idea, which is termed semi-parametric SVM. In [47], a fixed dictionary is used to analyze the convergence behavior of the KLMS algorithm. A derivation of the functional diffusion KLMS algorithm when removing the fixed dictionary constraint is given in [48]. Another extension is presented in [49], where a set of consensus constraints is included in the problem to ensure convergence and speedup the algorithm.

3.3 Diffusion spline filters

Another possibility for nonlinear learning over networks is given by considering spline adaptive filters (SAFs).888SAFs are the topic of Chapter 5. The idea of using SAFs in a distributed environment was recently introduced in [50]. In the following we briefly describe the distributed algorithm. The interested readers can refer to [51, 52] for introductory material on the SAF model.

Let us assume that the data are generated according to a restricted Wiener model given by:

[TABLE]

where $f_{k}^{o}$ is any smooth nonlinear function, and $\nu_{k}(n)$ is a noise term. A SAF mimics this architecture, where the nonlinear term is approximated via spline interpolation over a set of $Q$ fixed control points that are adapted during learning. Hence, in a distributed scenario each agent has estimates of the local part of the filter, $\mathbf{w}_{k,n}$ , and of the aforementioned control points, $\mathbf{q}_{k,n}$ , as shown schematically in Fig. 4.

Consider a single-task scenario, where each agent tries to minimize the expected squared loss. Following [50], we consider a combine-then-adapt scheme (CTA), where the combination is performed before the adaptation. The two steps are applied to both sets of parameters $\mathbf{w}_{k,n}$ and $\mathbf{q}_{k,n}$ simultaneously. However, by exploiting the SAF structure we can avoid exchanging the full weight vector $\mathbf{q}_{k,n}$ , as described in the following.

The combination step starts with the agents exchanging their current estimates of the linear weights, as:

[TABLE]

The new weights are used to compute the output of the linear part of the filter, denoted as $s_{k}(n)=\bm{\psi}_{k,n-1}^{T}\mathbf{u}_{k,n}$ . We use $i$ to denote the index of the closest control point to $s_{k}(n)$ in our set of fixed control points. As described in Chapter 5, the final output of a Wiener SAF depends only on the $i$ th control point and its $P$ right neighbors, with $P$ being the order of interpolation. Let us denote by $\mathbf{q}_{i,k,n-1}$ the set of such ‘active’ control points for agent $k$ , which are called the ‘span’ of the filter (see Chapter 5 for more details). We use a third subscript to denote dependence with respect to the span. The second combination step is performed only with respect to the current span:

[TABLE]

In the case of cubic interpolation, each $\bm{\xi}_{k,n-1}$ has dimensionality $4$ , making its exchange extremely efficient, with only a fixed overhead with respect to a classical diffusion LMS. Practically, every agent sends its current span index $i$ to its neighbors, and receives back the vectors $\mathbf{q}_{i,l,n-1}$ . For simplicity, the mixing weights $A_{lk}$ in the two diffusion steps are assumed identical.

Next, we proceed to the adaptation step. The complete SAF output given the new span is obtained as (again following the general rules of Chapter 5):

[TABLE]

where the vector $\mathbf{u}$ is constructed by taking powers up to a fixed order of the normalized value $\frac{s_{k}(n)}{\Delta x}-\left\lfloor\frac{s_{k}(n)}{\Delta x}\right\rfloor$ , where $\Delta x$ is the sampling precision of the spline. $\mathbf{B}$ is the spline matrix, e.g. the Catmull-Rom (CR) spline given by:

[TABLE]

Adaptation is made by performing two parallel gradient descent step:

[TABLE]

where $e_{k,n}$ is the instantaneous local error, and $\varphi^{\prime}(s_{k}(n))$ is the spline derivative with respect to the linear weights. Note that the diffusion LMS can be obtained as a special case, where each node initializes its nonlinearity as the identity, and the step size of the nonlinear part is set to zero.

4 A distributed kernel filter for multitask problems

As we saw in the previous section, several ideas have been proposed to model nonlinear systems in a distributed fashion, but almost none is framed for the multi-task scenario. As a first step towards this line of research, in this section we briefly combine some of the previous ideas to devise an efficient kernel-based diffusion algorithm for multi-task networks. In a nutshell, we combine the multi-task diffusion LMS presented in Section 2.3 with the functional diffusion KLMS of Section 3.2. To this end, consider again the data model in (14), where we assumed that all the minimizers are the same across the agents. More generally, we can consider the case where two functions $\psi_{k}^{o}$ and $\psi_{l}^{o}$ are assumed to be ‘close’ in the sense of the norm $\left\lVert\cdot\right\rVert_{\mathcal{H}}$ of the RKHS, whenever the corresponding agents are spatial neighbors:

[TABLE]

where $\mathcal{N}_{k}$ denotes the set of neighbors of $k$ , and $\sim$ denotes similarity. To recover the unknown functions, and leveraging over the basic idea described in Section 2.3, we aim at minimizing the following global cost function in a decentralized fashion:

[TABLE]

where $\eta>0$ is a regularization factor, and the nonnegative coefficients $\rho_{kl}\geq 0$ weight the similarity between different functions. Once again, we assume that, for each agent, the weights are positive and sum to one:

[TABLE]

Thus, each agent is interested in minimizing the local expected mean-squared error, under suitable proximity constraints on its function and the functions of its neighbors. The previous problem decomposes as a sum of local cost functions defined as:

[TABLE]

Each local cost function is independent of the estimate of agents which are not in its immediate neighborhood. Taking the Fréchet derivative of (31) gives us:

[TABLE]

where $\kappa(\cdot,\cdot)$ is the reproducing kernel associated to $\mathcal{H}$ . For simplicity, we assume that the mixing weights $\rho_{kl}$ are symmetrical (see the discussion at the end of Section 2.3). Making an instantaneous approximation for the expectation gives us the following local update rule in functional form at time instant $n$ :

[TABLE]

where the factor $2$ has been included in the step size $\mu_{k}$ . Considering $\mathcal{H}$ as the space of linear predictors over $\mathbf{u}_{k,n}$ , then (33) reduces to the diffusion LMS for multitask networks presented earlier. In order to have a feasible implementation, once again we assume a shared dictionary $\mathcal{D}$ among agents. (33) reduces to:

[TABLE]

5 Experimental evaluation

5.1 Experiment setup

In this section, we evaluate the proposed method on a simulated multitask nonlinear problem. The output at each agent is given by the following equation:

[TABLE]

which is composed of a common nonlinear part $f(\cdot)$ , a local linear part $\mathbf{w}_{k}^{T}\mathbf{u}_{k,n}$ , and a local noise of variance $\sigma_{k}^{2}$ . In particular, we considered a three dimensional input vector $\mathbf{u}=\left[u_{1},u_{2},u_{3}\right]^{T}$ , with the following nonlinearity:

[TABLE]

where $a$ and $b$ were generated from a normal distribution, similarly to the local coefficient vectors $\mathbf{w}_{k}$ . Noise variances were generated uniformly for each agent in the interval $\left[0,0.3\right]$ . We added an additional level of diversity over the network by randomly assigning the learning rates to the agents from the uniform distribution over the interval $\left[0,0.1\right]$ . We considered a network of $9$ agents, whose connectivity was randomly assigned such that each agent is connected in average with one fifth of the other agents, with the requirement that the overall graph is connected. The resulting network connectivity, an example of desired output, and a plot of the noise variances and learning rates, are all shown in Fig. 5.

We trained the network over a sequence of $1000$ time instants, with white Gaussian inputs with zero mean and unitary variance. The mixing matrix was chosen according to the max-degree heuristic:

[TABLE]

where $\text{deg}_{k}$ is the degree of node $k$ , and $\text{deg}_{\max}$ is the maximum degree of the network.999The degree of a node is the cardinality of the set of its direct neighbors. Each experiment was averaged over $500$ different runs, by keeping fixed the assignments shown in Fig. 5.

5.2 Results and discussion

We compared the performance of a standard diffusion LMS (D-LMS), a multitask D-LMS as described in Section 2.3 (D-MT-LMS), the diffusion KLMS described in Section 3.2, and the proposed multitask D-KLMS introduced in Section 4 (D-MT-KLMS). For the kernel algorithms, we used a Gaussian kernel:

[TABLE]

where $\gamma$ was chosen as the inverse of the dimensionality of $\mathbf{u}$ , which was found to provide a good accuracy. For the multitask algorithms, the regularization coefficients were selected uniformly as:

[TABLE]

while the regularization factor was set to $\eta=0.01$ . For the kernel algorithms, we fixed a priori a dataset of size $100$ with randomly extracted elements. The average MSE in dB across all runs is shown in Fig. 6.

As expected, the D-LMS was the poorest performing algorithm, due to the doubly incorrect assumptions that the agents share the same minimizer, and that the underlying function is linear. By relaxing one of the two assumptions, D-MT-LMS performed better, with an accuracy that is comparable to D-KLMS. Clearly, D-MT-KLMS was the best algorithm in this case, showing that it can be an effective solution for nonlinear multitask problems.

In Fig. 7 we show the MSE evolution for three representative agents. As expected, their performance is different depending on the selected learning rate and amount of noise, but the multitask algorithm is able to effectively combine the learning curves to obtain the average behavior as in the purple line of Fig. 6. Finally, in Fig. 8 we show the average MSE evolution when varying the size of the dictionary. Clearly, increasing the size improves the accuracy (up to a given upper bound), at the cost of a larger computational burden.

6 Discussion and open problems

Distributed inference is a fundamental tool according to today’s technological trends. In the adaptive filtering community, many classical algorithms can be readily extended to the distributed scenario by exploiting diffusion principles, where local adaptation steps are interleaved with communication steps between neighbors. The resulting algorithms are both computationally efficient, and deployable over a large set of scenarios. In this chapter, we reviewed the basic tools of this field, and we briefly surveyed some of the nonlinear extensions that have been proposed.

An important distinction can be made between single-task problems, where all agents share the same minimizer, and multitask problems, where the minimizers can be different but it is known that they share some similarities. We underlined how very little work has been done on the nonlinear multitask case, and we proposed a simple kernel-based diffusion algorithm to this end. Many extensions over the basic setup of this chapter are possible, most notably a way to remove the assumption of a shared dictionary, an adaptive way to build the regularization coefficients, a theoretical analysis of the algorithm, or additional extensions towards asynchronous networks. Finally, we can consider mixing multitask networks with multi-objective algorithms [53], such that each agent is interested in minimizing multiple objectives simultaneously.

Acknowledgments

The work of Simone Scardapane was supported in part by Italian MIUR, “Progetti di Ricerca di Rilevante Interesse Nazionale”, GAUChO project, under Grant 2015YPXH4W_004. The work of Jie Chen was supported in part by the National Natural Science Foundation of China (NSFC grant 61671382).

Bibliography53

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] V. Cevher, S. Becker, M. Schmidt, Convex optimization for big data: Scalable, randomized, and parallel algorithms for big data analytics, IEEE Signal Processing Magazine 31 (5) (2014) 32–43.
2[2] A. Sandryhaila, J. M. F. Moura, Big data analysis with signal processing on graphs: Representation and processing of massive data sets with irregular structure, IEEE Signal Processing Magazine 31 (5) (2014) 80–90.
3[3] P. Di Lorenzo, S. Barbarossa, P. Banelli, S. Sardellitti, Adaptive least mean squares estimation of graph signals, IEEE Transactions on Signal and Information Processing over Networks 2 (4) (2016) 555–568.
4[4] J. B. Predd, S. B. Kulkarni, H. V. Poor, Distributed learning in wireless sensor networks, IEEE Signal Processing Magazine 23 (4) (2006) 56–69.
5[5] J. Tsitsiklis, D. Bertsekas, M. Athans, Distributed asynchronous deterministic and stochastic gradient optimization algorithms, IEEE Transactions on Automatic Control 31 (9) (1986) 803–812.
6[6] A. Lazarevic, Z. Obradovic, The distributed boosting algorithm, in: Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2001, pp. 311–316.
7[7] P. A. Forero, A. Cano, G. B. Giannakis, Consensus-based distributed support vector machines, Journal of Machine Learning Research 11 (May) (2010) 1663–1707.
8[8] S. Scardapane, R. Fierimonte, P. Di Lorenzo, M. Panella, A. Uncini, Distributed semi-supervised support vector machines, Neural Networks 80 (2016) 43–52.