Audiovisual Speaker Tracking using Nonlinear Dynamical Systems with   Dynamic Stream Weights

Christopher Schymura; Dorothea Kolossa

arXiv:1903.06031·cs.CV·March 15, 2019

Audiovisual Speaker Tracking using Nonlinear Dynamical Systems with Dynamic Stream Weights

Christopher Schymura, Dorothea Kolossa

PDF

1 Repo

TL;DR

This paper introduces a flexible framework that integrates dynamic stream weights into nonlinear dynamical systems for audiovisual speaker tracking, enhancing data fusion and improving tracking accuracy under varying sensor reliability.

Contribution

It extends nonlinear dynamical systems with dynamic stream weights, proposes a recursive Gaussian filtering approach, and introduces a convex optimization method for estimating oracle weights, adaptable to various applications.

Findings

01

Improved speaker tracking performance over existing methods.

02

Effective dynamic weighting of audiovisual streams based on sensor reliability.

03

Framework is application-independent and adaptable.

Abstract

Data fusion plays an important role in many technical applications that require efficient processing of multimodal sensory observations. A prominent example is audiovisual signal processing, which has gained increasing attention in automatic speech recognition, speaker localization and related tasks. If appropriately combined with acoustic information, additional visual cues can help to improve the performance in these applications, especially under adverse acoustic conditions. A dynamic weighting of acoustic and visual streams based on instantaneous sensor reliability measures is an efficient approach to data fusion in this context. This paper presents a framework that extends the well-established theory of nonlinear dynamical systems with the notion of dynamic stream weights for an arbitrary number of sensory observations. It comprises a recursive state estimator based on the Gaussian…

Tables2

Table 1. TABLE I: Circular root mean squared errors in degrees with corresponding standard deviations, obtained by the proposed Bayesian filtering framework using oracle dynamic stream weights and the extended Kalman filter baseline methods. Stars ( ⋆ ) indicate a statistically significant improvement of the ODSW - EKFs over the EKF baseline with p < 0.05 𝑝 0.05 p<0.05

	Undistorted	Signal-to-noise ratio			Image rotation
		$0 dB$	$15 dB$	$30 dB$	$10^{\circ}$	$15^{\circ}$	$20^{\circ}$
KAVLoC ( $N = 70$ )
EKF (Audio)	$5.34 \pm 1.80$	$11.53 \pm 5.02$	$5.74 \pm 1.90$	$5.30 \pm 1.79$	$5.34 \pm 1.80$	$5.34 \pm 1.80$	$5.34 \pm 1.80$
EKF (Video)	$5.19 \pm 4.12$	$5.19 \pm 4.12$	$5.19 \pm 4.12$	$5.19 \pm 4.12$	$6.09 \pm 1.41$	$7.36 \pm 1.84$	$9.51 \pm 3.82$
EKF (Audiovisual)	$4.77 \pm 2.89$	$6.69 \pm 3.30$	$4.99 \pm 2.86$	$4.78 \pm 2.91$	$5.06 \pm 1.62$	$5.58 \pm 1.49$	$6.39 \pm 1.68$
ODSW-EKF (Gaussian)	$4.25 \pm 1.67$	$4.37 \pm 1.74$ ^⋆	$4.27 \pm 1.67$	$4.25 \pm 1.74$	$4.74 \pm 1.33$	$5.00 \pm 1.34$	$5.35 \pm 1.54$ ^⋆
ODSW-EKF (Dirichlet)	$4.15 \pm 1.38$	$4.28 \pm 1.47$ ^⋆	$4.18 \pm 1.39$	$4.15 \pm 1.41$	$4.81 \pm 1.34$	$5.07 \pm 1.38$	$5.40 \pm 1.65$ ^⋆
NAVLoC ( $N = 400$ )
EKF (Audio)	$10.86 \pm 3.99$	$10.65 \pm 3.36$	$10.79 \pm 3.86$	$10.85 \pm 3.96$	$10.86 \pm 3.99$	$10.86 \pm 3.99$	$10.86 \pm 3.99$
EKF (Video)	$8.82 \pm 0.70$	$8.82 \pm 0.70$	$8.82 \pm 0.70$	$8.82 \pm 0.70$	$9.12 \pm 1.53$	$9.86 \pm 1.83$	$9.81 \pm 1.30$
EKF (Audiovisual)	$9.54 \pm 2.82$	$9.40 \pm 2.44$	$9.49 \pm 2.74$	$9.53 \pm 2.81$	$9.94 \pm 3.40$	$10.50 \pm 3.90$	$10.65 \pm 3.98$
ODSW-EKF (Gaussian)	$8.83 \pm 0.81$ ^⋆	$8.82 \pm 0.76$ ^⋆	$8.83 \pm 0.81$ ^⋆	$8.83 \pm 0.82$ ^⋆	$9.00 \pm 1.79$ ^⋆	$9.72 \pm 2.82$ ^⋆	$10.16 \pm 3.31$
ODSW-EKF (Dirichlet)	$8.83 \pm 0.81$ ^⋆	$8.81 \pm 0.76$ ^⋆	$8.82 \pm 0.81$ ^⋆	$8.83 \pm 0.82$ ^⋆	$8.99 \pm 1.79$ ^⋆	$9.72 \pm 2.82$ ^⋆	$10.16 \pm 3.31$
MVAD ( $N = 6$ )
EKF (Audio)	$5.37 \pm 2.53$	$10.33 \pm 5.69$	$10.73 \pm 6.65$	$4.81 \pm 1.84$	$5.37 \pm 2.53$	$5.37 \pm 2.53$	$5.37 \pm 2.53$
EKF (Video)	$1.81 \pm 1.69$	$1.81 \pm 1.69$	$1.81 \pm 1.69$	$1.81 \pm 1.69$	$4.33 \pm 2.41$	$3.89 \pm 1.37$	$4.89 \pm 1.81$
EKF (Audiovisual)	$2.32 \pm 1.50$	$2.98 \pm 1.53$	$2.34 \pm 1.59$	$2.33 \pm 1.55$	$4.31 \pm 2.06$	$4.16 \pm 0.74$	$5.00 \pm 1.41$
ODSW-EKF (Gaussian)	$1.71 \pm 1.65$	$1.76 \pm 1.64$	$1.72 \pm 1.65$	$1.71 \pm 1.65$	$3.99 \pm 2.58$	$3.87 \pm 1.36$	$5.04 \pm 2.12$
ODSW-EKF (Dirichlet)	$1.71 \pm 1.65$	$1.77 \pm 1.65$	$1.72 \pm 1.65$	$1.71 \pm 1.66$	$4.00 \pm 2.58$	$3.93 \pm 1.42$	$5.04 \pm 2.12$

Table 2. TABLE II: Circular root mean squared errors in degrees with standard deviations obtained using different audiovisual speaker tracking algorithms. Values in a column suffixed with different superscript letters are significantly different from each other at p < 0.05 𝑝 0.05 p<0.05 .

	KAVLoC	NAVLoC	MVAD
EKF	$6.16 \pm {1.73}^{a}$	$10.08 \pm {3.37}^{a}$	$4.05 \pm {0.61}^{a}$
Gehrig et al. [30]	$6.42 \pm {1.56}^{a}$	$10.37 \pm {1.84}^{a}$	$4.59 \pm {0.63}^{a}$
Gerlach et al. [31]	$6.22 \pm {4.46}^{a}$	$15.20 \pm {4.68}^{b}$	$2.85 \pm {0.53}^{b}$
Qian et al. [32]	$6.21 \pm {2.86}^{a}$	$10.17 \pm {7.22}^{a}$	$3.93 \pm {0.37}^{b}$
ODSW-EKF (Dirichlet)	$5.09 \pm {1.27}^{b}$	$9.32 \pm {1.99}^{c}$	$3.64 \pm {0.91}^{b}$
DSW-EKF	$6.12 \pm {1.58}^{a}$	$9.76 \pm {1.99}^{a}$	$3.64 \pm {0.64}^{b}$

Equations87

x_{k}

x_{k}

y_{m, k}

p (X_{0 : k}

p (X_{0 : k}

p (x_{0}) k^{'} = 1 \prod k p (x_{k^{'}} ∣ x_{k^{'} - 1}) m = 1 \prod M p (y_{m, k^{'}} ∣ x_{k^{'}})^{λ_{m, k^{'}}},

p (x_{k} ∣ Y_{1, 1 : k - 1}, \dots, Y_{M, 1 : k - 1}) =

p (x_{k} ∣ Y_{1, 1 : k - 1}, \dots, Y_{M, 1 : k - 1}) =

\int p (x_{k} ∣ x_{k - 1}) p (x_{k - 1} ∣ Y_{1, 1 : k - 1}, \dots, Y_{M, 1 : k - 1}) d x_{k - 1}

p

p

p (x_{k} ∣ Y_{1, 1 : k - 1}, \dots, Y_{M, 1 : k - 1}) m = 1 \prod M p (y_{m, k} ∣ x_{k})^{λ_{m, k}} .

f (x_{k - 1}) \approx f (\hat{x}_{k - 1}) + F (\hat{x}_{k - 1}) δ_{k - 1}

f (x_{k - 1}) \approx f (\hat{x}_{k - 1}) + F (\hat{x}_{k - 1}) δ_{k - 1}

h_{m} (x_{k}) \approx h_{m} (\hat{x}_{k}) + H_{m} (\hat{x}_{k}) δ_{k},

h_{m} (x_{k}) \approx h_{m} (\hat{x}_{k}) + H_{m} (\hat{x}_{k}) δ_{k},

p (x_{k} ∣ x_{k - 1})

p (x_{k} ∣ x_{k - 1})

p (y_{m, k} ∣ x_{k})

p(\boldsymbol{x}_{k}\,|\,\mathcal{Y}_{1,1:k},\,\ldots,\,\mathcal{Y}_{M,1:k})=\mathcal{N}\Big{(}\boldsymbol{x}_{k}\,|\,\hat{\boldsymbol{x}}_{k},\,\hat{\boldsymbol{\Sigma}}_{k}\Big{)},

p(\boldsymbol{x}_{k}\,|\,\mathcal{Y}_{1,1:k},\,\ldots,\,\mathcal{Y}_{M,1:k})=\mathcal{N}\Big{(}\boldsymbol{x}_{k}\,|\,\hat{\boldsymbol{x}}_{k},\,\hat{\boldsymbol{\Sigma}}_{k}\Big{)},

\displaystyle\log\Big{\{}p

\displaystyle\log\Big{\{}p

(x_{k} - \hat{x}_{k ∣ k - 1})^{T} \hat{Σ}_{k ∣ k - 1}^{- 1} (x_{k} - \hat{x}_{k ∣ k - 1})

\displaystyle+\sum_{m=1}^{M}\lambda_{m,k}\Big{[}\Big{(}\boldsymbol{y}_{m,k}-h_{m}(\hat{\boldsymbol{x}}_{k})-\boldsymbol{H}_{m,k}(\boldsymbol{x}_{k}-\hat{\boldsymbol{x}}_{k})\Big{)}^{\mathrm{T}}

\displaystyle\times\boldsymbol{R}_{m}^{-1}\Big{(}\boldsymbol{y}_{m,k}-h_{m}(\hat{\boldsymbol{x}}_{k})-\boldsymbol{H}_{m,k}(\boldsymbol{x}_{k}-\hat{\boldsymbol{x}}_{k})\Big{)}\Big{]}.

\frac{\partial}{\partial x _{k}}

\frac{\partial}{\partial x _{k}}

\hat{Σ}_{k ∣ k - 1}^{- 1} (\hat{x}_{k} - \hat{x}_{k ∣ k - 1})

\displaystyle+\sum_{m=1}^{M}\lambda_{m,k}\Big{[}\boldsymbol{H}_{m,k}^{\mathrm{T}}\boldsymbol{R}_{m}^{-1}\Big{(}h_{m}(\hat{\boldsymbol{x}}_{k|k-1})-\boldsymbol{y}_{m,k}\Big{)}

\displaystyle+\boldsymbol{H}_{m,k}^{\mathrm{T}}\boldsymbol{R}_{m}^{-1}\boldsymbol{H}_{m,k}\Big{(}\hat{\boldsymbol{x}}_{k}-\hat{\boldsymbol{x}}_{k|k-1}\Big{)}\Big{]}

\frac{\partial ^{2}}{\partial x _{k}^{2}} lo g

\frac{\partial ^{2}}{\partial x _{k}^{2}} lo g

\displaystyle\hat{\boldsymbol{\Sigma}}_{k|k-1}^{-1}+\sum_{m=1}^{M}\lambda_{m,k}\Big{[}\boldsymbol{H}_{m,k}^{\mathrm{T}}\boldsymbol{R}_{m}^{-1}\boldsymbol{H}_{m,k}\Big{]}.

\hat{\boldsymbol{\Sigma}}_{k}=\Big{(}\boldsymbol{I}-\sum_{m=1}^{M}\lambda_{m,k}\boldsymbol{K}_{m,k}\boldsymbol{H}_{m,k}\Big{)}\hat{\boldsymbol{\Sigma}}_{k|k-1}

\hat{\boldsymbol{\Sigma}}_{k}=\Big{(}\boldsymbol{I}-\sum_{m=1}^{M}\lambda_{m,k}\boldsymbol{K}_{m,k}\boldsymbol{H}_{m,k}\Big{)}\hat{\boldsymbol{\Sigma}}_{k|k-1}

K_{m, k} = \hat{Σ}_{k} H_{m, k}^{T} R_{m}^{- 1}

K_{m, k} = \hat{Σ}_{k} H_{m, k}^{T} R_{m}^{- 1}

R_{1} + λ_{1, k} H_{1, k} \hat{Σ}_{k ∣ k - 1} H_{1, k}^{T} ⋮ λ_{1, k} H_{M, k} \hat{Σ}_{k ∣ k - 1} H_{1, k}^{T} \dots ⋱ \dots λ_{M, k} H_{1, k} \hat{Σ}_{k ∣ k - 1} H_{M, k}^{T} ⋮ R_{M} + λ_{M, k} H_{M, k} \hat{Σ}_{k ∣ k - 1} H_{M, k}^{T} K_{1, k}^{T} ⋮ K_{M, k}^{T} = H_{1, k} \hat{Σ}_{k ∣ k - 1} ⋮ H_{M, k} \hat{Σ}_{k ∣ k - 1}

R_{1} + λ_{1, k} H_{1, k} \hat{Σ}_{k ∣ k - 1} H_{1, k}^{T} ⋮ λ_{1, k} H_{M, k} \hat{Σ}_{k ∣ k - 1} H_{1, k}^{T} \dots ⋱ \dots λ_{M, k} H_{1, k} \hat{Σ}_{k ∣ k - 1} H_{M, k}^{T} ⋮ R_{M} + λ_{M, k} H_{M, k} \hat{Σ}_{k ∣ k - 1} H_{M, k}^{T} K_{1, k}^{T} ⋮ K_{M, k}^{T} = H_{1, k} \hat{Σ}_{k ∣ k - 1} ⋮ H_{M, k} \hat{Σ}_{k ∣ k - 1}

[R + U_{k} W_{k} U_{k}^{T}] K_{k} = B_{k} \hat{Σ}_{k ∣ k - 1},

[R + U_{k} W_{k} U_{k}^{T}] K_{k} = B_{k} \hat{Σ}_{k ∣ k - 1},

L_{k} = λ_{1, k} ⋮ λ_{1, k} \dots ⋱ \dots λ_{M, k} ⋮ λ_{M, k}

L_{k} = λ_{1, k} ⋮ λ_{1, k} \dots ⋱ \dots λ_{M, k} ⋮ λ_{M, k}

\boldsymbol{K}_{k}=\Big{(}\boldsymbol{R}^{-1}-\boldsymbol{R}^{-1}\boldsymbol{U}_{k}\boldsymbol{\Gamma}_{k}\boldsymbol{U}_{k}^{\mathrm{T}}\boldsymbol{R}^{-1}\Big{)}\boldsymbol{B}_{k}\hat{\boldsymbol{\Sigma}}_{k|k-1},

\boldsymbol{K}_{k}=\Big{(}\boldsymbol{R}^{-1}-\boldsymbol{R}^{-1}\boldsymbol{U}_{k}\boldsymbol{\Gamma}_{k}\boldsymbol{U}_{k}^{\mathrm{T}}\boldsymbol{R}^{-1}\Big{)}\boldsymbol{B}_{k}\hat{\boldsymbol{\Sigma}}_{k|k-1},

\hat{\boldsymbol{x}}_{k}=\hat{\boldsymbol{x}}_{k|k-1}+\sum_{m=1}^{M}\lambda_{m,k}\boldsymbol{K}_{m,k}\Big{(}\boldsymbol{y}_{m,k}-h_{m}(\hat{\boldsymbol{x}}_{k|k-1})\Big{)}.

\hat{\boldsymbol{x}}_{k}=\hat{\boldsymbol{x}}_{k|k-1}+\sum_{m=1}^{M}\lambda_{m,k}\boldsymbol{K}_{m,k}\Big{(}\boldsymbol{y}_{m,k}-h_{m}(\hat{\boldsymbol{x}}_{k|k-1})\Big{)}.

\hat{x}_{k ∣ k - 1} = f (\hat{x}_{k - 1})

\hat{x}_{k ∣ k - 1} = f (\hat{x}_{k - 1})

\hat{Σ}_{k ∣ k - 1} = F_{k - 1} \hat{Σ}_{k - 1} F_{k - 1}^{T} + Q

\hat{Σ}_{k ∣ k - 1} = F_{k - 1} \hat{Σ}_{k - 1} F_{k - 1}^{T} + Q

\tilde{y}_{m, k} = y_{m, k} - h_{m} (\hat{x}_{k ∣ k - 1})

\tilde{y}_{m, k} = y_{m, k} - h_{m} (\hat{x}_{k ∣ k - 1})

\hat{x}_{k} = \hat{x}_{k ∣ k - 1} + m = 1 \sum M λ_{m, k} K_{m, k} \tilde{y}_{m, k}

\hat{x}_{k} = \hat{x}_{k ∣ k - 1} + m = 1 \sum M λ_{m, k} K_{m, k} \tilde{y}_{m, k}

\displaystyle\quad~{}\,\hat{\boldsymbol{\Sigma}}_{k}=\Big{(}\boldsymbol{I}-\sum_{m=1}^{M}\lambda_{m,k}\boldsymbol{K}_{m,k}\boldsymbol{H}_{m,k}\Big{)}\hat{\boldsymbol{\Sigma}}_{k|k-1}

\displaystyle\quad~{}\,\hat{\boldsymbol{\Sigma}}_{k}=\Big{(}\boldsymbol{I}-\sum_{m=1}^{M}\lambda_{m,k}\boldsymbol{K}_{m,k}\boldsymbol{H}_{m,k}\Big{)}\hat{\boldsymbol{\Sigma}}_{k|k-1}

p (X_{1 : K}, Y_{1, 1 : K},

p (X_{1 : K}, Y_{1, 1 : K},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rub-ksv/avtrack
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Audiovisual Speaker Tracking using Nonlinear Dynamical Systems with Dynamic Stream Weights

Christopher Schymura and Dorothea Kolossa C. Schymura and D. Kolossa are with the Cognitive Signal Processing Group, Institute of Communication Acoustics, Faculty of Electrical Engineering and Information Technology, Ruhr University Bochum, 44801 Bochum, Germany (e-mail: [email protected]; [email protected]).

Abstract

Data fusion plays an important role in many technical applications that require efficient processing of multimodal sensory observations. A prominent example is audiovisual signal processing, which has gained increasing attention in automatic speech recognition, speaker localization and related tasks. If appropriately combined with acoustic information, additional visual cues can help to improve the performance in these applications, especially under adverse acoustic conditions. A dynamic weighting of acoustic and visual streams based on instantaneous sensor reliability measures is an efficient approach to data fusion in this context. This paper presents a framework that extends the well-established theory of nonlinear dynamical systems with the notion of dynamic stream weights for an arbitrary number of sensory observations. It comprises a recursive state estimator based on the Gaussian filtering paradigm, which incorporates dynamic stream weights into a framework closely related to the extended Kalman filter. Additionally, a convex optimization approach to estimate oracle dynamic stream weights in fully observed dynamical systems utilizing a Dirichlet prior is presented. This serves as a basis for a generic parameter learning framework of dynamic stream weight estimators. The proposed system is application-independent and can be easily adapted to specific tasks and requirements. A study using audiovisual speaker tracking tasks is considered as an exemplary application in this work. An improved tracking performance of the dynamic stream weight-based estimation framework over state-of-the-art methods is demonstrated in the experiments.

I Introduction

Effective fusion of signals acquired from different sensory modalities is an important aspect of many technical applications. With the advent of emerging technologies like autonomous driving [1, 2], assistive robotics [3, 4], smart home environments [5, 6, 7] and automatic speech recognition (ASR) [8, 9, 10], significant progress on the development of algorithms for multisensor data fusion (MDF) has been made [11]. The fundamental problem that MDF tries to solve is, how signals obtained from different sensors can be combined to maximize information gain for the variables of interest [12], e.g. the location of a speaker or the transcription of a spoken sentence.

Many successful MDF algorithms belong to the class of probabilistic fusion methods, often also denoted as Bayesian fusion. An exceptional example in this regard is the well-established Kalman filter (KF) [13]. Even though it relies on a linear Gaussian model, the KF has gained tremendous success in a wide range of applications, due to its mathematical intuitiveness and computational efficiency [11, 14]. Since it was first introduced, many extensions to the KF have been proposed, most notably the extended Kalman filter (EKF) [15] and the unscented Kalman filter (UKF) [16], which both overcome the linearity constraint. Additionally, particle filters (PFs) were introduced as a framework to cope with nonlinear systems affected by non-Gaussian noise [17].

A common property of the aforementioned Bayesian fusion techniques is that they handle the actual sensor fusion implicitly. For instance, the standard KF maintains a joint observation noise covariance matrix describing the noise characteristics of all sensors. This is an efficient approach for MDF, as long as the reliabilities of individual sensors do not change over time. To overcome this constraint, adaptive KFs have been proposed [18], which maintain and update an estimate of the observation noise covariance matrix at each time step. However, this adaptation process requires a sufficient amount of past observations to achieve reliable estimates of the observation noise covariance matrix. An optimal adaptation framework would allow to weight the individual contributions of each sensor instantaneously.

This was recently proposed in [19], which presented an inference scheme for continuous state spaces in the regime of linear dynamical systems (DSs) that incorporated dynamic stream weights (DSWs) into the estimation process. DSWs are weighting factors that control the contribution of the individual sensory observations to the estimation process at each time-step. The work was inspired by previous applications of DSWs, which were initially proposed in the context of audiovisual ASR. Pioneering work in this regard was conducted in [20] for probabilistic inference incorporating DSWs into hidden Markov models (HMMs). Compared to conventional Bayesian fusion techniques, this allows to rapidly adapt the estimation process. However, predicting DSWs requires the availability of instantaneous sensor reliability measures, which depend on sensor type and the desired application. Research on appropriate reliability measures has already been conducted in the context of audiovisual ASR [21, 22]. A standard approach is the computation of oracle dynamic stream weights (ODSWs) using training data with available ground-truth information and subsequently perform supervised training, using e.g. a regression function or a neural network [20, 22, 10].

This paper introduces an extended framework for MDF using nonlinear DSs based on the method initially proposed in [19]. This method was restricted to a linear Gaussian state space representation and two sensory modalities. A general recursive inference method for state estimation that can be applied to nonlinear DSs with Gaussian noise, DSWs and an arbitrary number of observations is proposed in this paper. Furthermore, a means to obtain ODSWs from fully observed DSs based on a Dirichlet prior imposed on the stream weights is presented. Compared to the previously introduced ODSW estimators using a Gaussian prior [20, 19], this novel approach provides a clear and intuitive probabilistic interpretation. Additionally, a generic learning scheme for DSWs prediction models using application-specific reliability measures is presented. It allows to train a broad class of models, whose sole restrictions are differentiability with respect to the function parameters and a softmax output function. Hence, nonlinear models like e.g. deep neural networks (DNNs) are naturally supported as potential DSW estimators.

The application of audiovisual speaker tracking is considered in this study. It is well suited for evaluating Bayesian MDF approaches with continuous state spaces, as the variables of interest are encoded as either Cartesian coordinates or direction-of-arrival (DoA) values. Furthermore, speaker localization scenarios involve highly dynamic components if speakers are moving within the environment and have to cope with various types of disturbances. This includes background noise and reverberation affecting the acoustic signals, as well as changing lighting conditions and occlusion affecting the video signals. It should be noted that the entire framework proposed in this paper is not restricted to a particular application, but should rather be considered as a generic approach to Bayesian MDF incorporating DSWs.

A variety of related models for Bayesian fusion have been previously proposed for many technical applications. To put the work presented in this study into context, related existing approaches will be briefly reviewed in the following.

Early work on MDF introduced a special class of DSs that utilize standard KFs and their extensions for Bayesian fusion. These systems are referred to as distributed dynamical systems (DDSs) [23, 24, 25]. They provide a natural extension to the DS paradigm by incorporating multiple independent sensors with distinct observation models and noise characteristics. Prominent application domains for DDSs are wireless sensor networks [26, 27] and multiagent systems [28]. The mathematical foundations of DDSs provide a generic framework for modeling systems with multimodal sensory input. For instance, the work reported in [26] proposes a distributed KF for state estimation in wireless sensor networks. It is based on decomposing the standard KF into a set of so-called micro KFs for each of the individual sensory observations. This results in a network of KFs which is capable of collectively estimating the system state. A theoretical analysis of similar inference algorithms for DDSs is given in [29].

Besides the many contributions in the field of wireless sensor networks, further successful approaches to Bayesian MDF have been proposed in other technical fields. Focusing on the domain of audiovisual signal processing, a variety of algorithms for audiovisual speaker tracking is available. For instance, the framework described in [30] uses an EKF with a joint audiovisual observation vector to localize and track speakers during recorded seminars. It does not incorporate a distributed architecture of the underlying DS, but rather handles data fusion implicitly during the recursive update step. Another approach was introduced in [31], where a PF was utilized to localize and track speakers in domestic environments. The framework provided explicit control over the individual contributions of acoustic and visual observations via exponential weighting parameters, which were determined a-priori using a grid-search. A recently proposed algorithm for speaker tracking has explicitly considered sensor reliability measures within a particle filtering framework [32]. This work utilized the peak value of the acoustic global coherence field and the correlation between a color-histogram template and the detected face as features, which affected the weighting and resampling step of the particle filter.

A noteworthy study that fits nicely into the context of this work is the framework based on distributed multi-sensor, multi-target tracking presented in [33]. It proposes a recursive Bayesian filter that assigns weights to sensory observations based on exponential mixture densities. This representation of the filtering distribution is mathematically similar to the framework proposed in this paper. However, the weighting scheme serves a different purpose, namely, optimizing track-to-track fusion in multi-object distributions. This stands in contrast to the present study, where the weighting is applied instantaneously rather than using fixed weighting factors.

In this regard, the framework presented in this paper is most closely related to approaches developed for discrete state spaces based on HMMs in audiovisual ASR [20, 22, 21, 10]. The primary contribution of the present work is the introduction of DSW-based MDF into continuous DSs with nonlinear dynamics and observations, which is a natural extension of the initial work reported in [19]. Additionally, a novel approach for computing ODSWs incorporating a Dirichlet prior is derived and integrated into a generic learning framework that allows to train DSW prediction models.

II Dynamical System Description

This section presents an extension of the state estimation framework proposed in [19] to nonlinear DSs with DSWs and an arbitrary number of independent observations. A generic state estimation algorithm based on the Gaussian filter paradigm is derived and its relation to the standard EKF is discussed. A structural overview of the proposed system is depicted in Fig. 1. It illustrates the relation between all system components that will be described in the following sections.

II-A Nonlinear system model

Consider an autonomous, discrete-time nonlinear DS with Gaussian noise and $M$ independent observations

[TABLE]

where $\boldsymbol{x}_{k}\in\mathbb{R}^{D_{x}}$ denotes the state vector at discrete time step $k$ and $\boldsymbol{y}_{m,k}$ represents the $m$ -th observation vector with $m=1,\,\ldots,\,M$ . The system dynamics are governed by the state transition function $f(\boldsymbol{x}_{k-1})$ and zero-mean Gaussian noise $\boldsymbol{v}_{k}\sim\mathcal{N}(\boldsymbol{0},\,\boldsymbol{Q})$ with covariance matrix $\boldsymbol{Q}\in\mathbb{R}^{D_{x}\times D_{x}}$ . The state-to-observation transformations are described by $M$ observation functions $h_{m}(\boldsymbol{x}_{k})$ , which are affected by zero-mean Gaussian noise terms $\boldsymbol{w}_{m,k}\sim\mathcal{N}(\boldsymbol{0},\,\boldsymbol{R}_{m})$ with covariance matrices $\boldsymbol{R}_{m}\in\mathbb{R}^{D_{y,m}\times D_{y,m}}$ . Autonomous DSs are considered in this work as they are widely used in localization and tracking applications. However, an extension of the proposed methods to DSs with external input is generally possible.

Following the approach proposed in [19], the incorporation of DSWs $\lambda_{m,k}$ allows to express the joint likelihood function of the DS described by Eqs. (1)–(2) up to time step $k$ as

[TABLE]

where $\mathcal{X}_{0:k}=\{\boldsymbol{x}_{0},\,\ldots,\boldsymbol{x}_{k}\}$ and $\mathcal{Y}_{m,1:k}=\{\boldsymbol{y}_{m,1},\,\ldots,\boldsymbol{y}_{m,k}\}$ are the corresponding sequences (also referred to as trajectories) of state and observation vectors. The DSWs in Eq. (3) must satisfy the constraint $\sum_{m=1}^{M}\lambda_{m,k}=1~{}\forall\,k$ .

II-B State estimation

A Gaussian filter to infer the state of the DS can be derived by marginalizing out the previous states in the joint likelihood function in Eq. (3). This yields the well-known prediction and update steps of the Bayes filter [34, Chap. 2], given by

[TABLE]

and

[TABLE]

Assuming that the first derivatives of the state transition function and the observation functions in Eqs. (1)–(2) exist, a first-order Taylor series expansion about the estimated state posterior mean $\hat{\boldsymbol{x}}_{k}=E\{\boldsymbol{x}_{k}\,|\,\mathcal{Y}_{1,1:k},\,\ldots,\,\mathcal{Y}_{M,1:k}\}$ can be expressed as

[TABLE]

and

[TABLE]

with $\boldsymbol{\delta}_{k}=\boldsymbol{x}_{k}-\hat{\boldsymbol{x}}_{k}$ , where $\boldsymbol{F}(\hat{\boldsymbol{x}}_{k-1})\in\mathbb{R}^{D_{x}\times D_{x}}$ is the Jacobian of the state transition function and $\boldsymbol{H}_{m}(\hat{\boldsymbol{x}}_{k})\in\mathbb{R}^{D_{y,m}\times D_{x}}$ is the Jacobian of the $m$ -th observation function, respectively. This approach is equivalent to the derivation of the EKF [15]. For notational convenience, the explicit dependency of the Jacobians on the state will be omitted in the following sections, according to $\boldsymbol{F}(\hat{\boldsymbol{x}}_{k-1})\equiv\boldsymbol{F}_{k-1}$ and $\boldsymbol{H}_{m}(\hat{\boldsymbol{x}}_{k})\equiv\boldsymbol{H}_{m,k}$ . This allows to express the probability density functions (PDFs) in Eqs. (4) and (5) as

[TABLE]

and

[TABLE]

where $\hat{\boldsymbol{\Sigma}}_{k}$ is the estimated state posterior covariance matrix, which needs to be updated conjointly with the estimated state posterior mean $\hat{\boldsymbol{x}}_{k}$ at each time step. This update will be performed recursively via the prediction and update steps of the Gaussian filter.

The prediction step is obtained by inserting Eqs. (8) and (10) into Eq. (4), taking the first and second derivative and solving for the predicted state mean $\hat{\boldsymbol{x}}_{k|k-1}$ and the predicted state covariance matrix $\hat{\boldsymbol{\Sigma}}_{k|k-1}$ . The resulting equations given in Alg. 1 are identical to the prediction step of the EKF. Hence, the derivation is omitted here, cf. [34, Chap. 3] for details.

For the derivation of the update step, Eqs. (9) and (10) are inserted into the the log-likelihood form of Eq. (5), yielding

[TABLE]

Taking the first and second derivative of Eq. (11), where the system state $\boldsymbol{x}_{k}$ has been substituted with the estimated state posterior mean $\hat{\boldsymbol{x}}_{k}$ , results in the expressions

[TABLE]

and

[TABLE]

The second derivative in Eq. (13) is the curvature of the quadratic function in Eq. (11), whose inverse is the covariance matrix of the state posterior $p(\boldsymbol{x}_{k}\,|\,\mathcal{Y}_{1,1:k},\,\ldots,\,\mathcal{Y}_{M,1:k})$ , cf. [34, Chap. 3]. Therefore, a closed-form expression

[TABLE]

for the estimated state posterior covariance matrix can be obtained, where

[TABLE]

is defined as the Kalman gain corresponding to the $m$ -th observation. To resolve the dependency of Eq. (15) on the estimated state posterior covariance matrix $\hat{\boldsymbol{\Sigma}}_{k}$ , Eq. (14) is inserted into Eq. (15), which allows to derive an analytic solution for the individual Kalman gains $\boldsymbol{K}_{m,k}$ , by solving the system of linear matrix equations shown in Eq. (16). This system can be expressed as

[TABLE]

where $\boldsymbol{R}=\mathrm{blkdiag}(\boldsymbol{R}_{1},\,\ldots,\,\boldsymbol{R}_{M})$ is a block-diagonal matrix composed of all corresponding observation noise covariance matrices, $\boldsymbol{U}_{k}=\mathrm{blkdiag}(\boldsymbol{H}_{1,k},\,\ldots,\,\boldsymbol{H}_{M,k})$ comprises all observation Jacobians, $\boldsymbol{W}_{k}=\boldsymbol{L}_{k}\otimes\hat{\boldsymbol{\Sigma}}_{k|k-1}$ with

[TABLE]

and $\boldsymbol{B}_{k}=\begin{bmatrix}\boldsymbol{H}_{1,k}&\cdots&\boldsymbol{H}_{M,k}\end{bmatrix}^{\mathrm{T}}$ . The Kalman gain solution matrix $\boldsymbol{K}=\begin{bmatrix}\boldsymbol{K}_{1,k}^{\mathrm{T}}&\cdots&\boldsymbol{K}_{M,k}^{\mathrm{T}}\end{bmatrix}^{\mathrm{T}}$ contains all Kalman gains associated with the individual observations. A solution of Eq. (17) is obtained via the binomial inverse theorem [35], where $\boldsymbol{W}_{k}$ is always singular for $M>1$ , which is shown in detail in App. A. This yields

[TABLE]

with $\boldsymbol{\Gamma}_{k}=\boldsymbol{W}_{k}(\boldsymbol{I}+\boldsymbol{U}_{k}^{\mathrm{T}}\boldsymbol{R}^{-1}\boldsymbol{U}_{k}\boldsymbol{W}_{k})^{-1}$ , which allows an efficient computation of the Kalman gains at each step, as the inverse of the observation noise block-diagonal covariance matrix can be precomputed.

The corresponding state update recursion are obtained by inserting Eq. (15) into Eq. (12), exploiting the relationship in Eq. (14) and solving for the estimated state posterior mean

[TABLE]

The resulting prediction and update steps of the presented Gaussian filtering algorithm are summarized in Alg. 1. An interactive Python implementation of the proposed algorithm is available online111https://github.com/rub-ksv/avtrack.

II-C Comparison with the extended Kalman filter

The state estimation framework presented in this work is a generalization of the standard EKF, which is covered as a special case. This can be easily verified by evaluating Eqs. (14), (16) and (20) for $M=1$ and $\lambda_{1,k}=1~{}\forall\,k$ , which yields the conventional EKF update step. Both methods rely on a first-order Taylor expansion of the nonlinear state transition and observation functions. However, the standard EKF is not capable of incorporating DSWs, which is a unique property of the algorithm proposed here.

III Oracle dynamic stream weights

ODSWs have already been thoroughly investigated in the context of ASR, where they were utilized in HMM-based recognizers with audiovisual input [36, 20, 21, 10, 37]. A prominent application of ODSWs is the generation of training targets for supervised learning of DSW estimators. This has been done extensively for audiovisual ASR, but is generally application-independent. Hence, a means to obtain ODSWs based on the nonlinear DS model discussed in the previous section is presented in the following.

III-A Maximum likelihood estimation

The likelihood function introduced in Eq. (3) can be exploited to obtain ODSWs if the DS in Eqs. (1)–(2) is fully observed [19]. Therefore, a prior distribution other than the uniform prior has to be imposed on the DSWs. If a uniform prior is assumed, the optimization function will be a linear function of the DSWs. This results in a problem already reported in the context of audiovisual ASR, where all ODSWs are restricted to take boundary values $\lambda_{m,k}^{\star}\in\{0,\,1\}$ , preventing a smooth weighting of the individual modalities [20, 38]. Therefore, given a sequence of observed states $\mathcal{X}_{1:K}=\{\boldsymbol{x}_{1},\,\ldots,\,\boldsymbol{x}_{K}\}$ and $M$ observations $\mathcal{Y}_{m,1:K}=\{\boldsymbol{y}_{m,1},\,\ldots,\,\boldsymbol{y}_{m,K}\}$ with $m=1,\,\ldots,\,M$ , a modified joint likelihood function for the fully observed model can be expressed as

[TABLE]

where $\mathcal{L}_{k}=\{\boldsymbol{\lambda}_{1},\,\ldots,\,\boldsymbol{\lambda}_{k}\}$ with $\boldsymbol{\lambda}_{k}=\begin{bmatrix}\lambda_{1,k}&\cdots&\lambda_{M,k}\end{bmatrix}^{\mathrm{T}}$ is a sequence of DSWs, which are i.i.d. and obey the constraint $\sum_{m=1}^{M}\lambda_{m,k}=1~{}\forall\,k$ .

III-B Gaussian prior for the special case with two observations

A method to obtain a maximum likelihood (ML) estimate of the ODSWs with a Gaussian prior was proposed in [19] for the special case of $M=2$ , which requires a scalar ODSW $\lambda_{k}^{\star}=\lambda_{1,k}^{\star}=1-\lambda_{2,k}^{\star}$ per time step. An analytic solution

[TABLE]

for linear dynamical systems (LDSs) was derived, where $\mu_{\lambda}$ denotes the mean and $\sigma_{\lambda}^{2}$ represents the variance of the Gaussian prior. This solution is closely related to the Gaussian ODSW estimator for coupled HMMs introduced in [20]. The ODSWs were clipped to fit into the range $[0,\,1]$ . If the mean and variance parameters are appropriately chosen, the resulting distribution could still be approximately assumed as Gaussian within this interval. Nonetheless, a straightforward extension to DSs with $M>2$ is problematic, as a Gaussian PDF is not able to handle the constraint that multiple ODSWs have to sum to one. Clipping with subsequent renormalization of the ODSW values could be utilized in this case. However, an accessible interpretation of the mean and variance parameters of the Gaussian prior remains unclear, as clipping and renormalization impose a nonlinear transform on the obtained ODSWs. The Gaussian prior requires optimization of two hyperparameters on a dedicated validation set, which is usually conducted via a computationally expensive grid search, cf. [21].

III-C Dirichlet prior for an arbitrary number of observations

To cope with an arbitrary number of observations in a theoretically sound and interpretable probabilistic framework, a symmetric Dirichlet prior

[TABLE]

with concentration parameter $\alpha>1$ is utilized in this work. This single hyperparameter still has to be tuned by e.g. a grid search. Inserting Eq. (23) into Eq. (21) and taking into account the i.i.d. property of subsequent trajectory points yields

[TABLE]

for the $k$ -th time step, which can be transformed into the log-domain and hence serve as an objective function for ML estimation according to

[TABLE]

Obtaining a ML estimate of the ODSWs therefore requires to solve the optimization problem

[TABLE]

for each time step. It is shown in Appendix B that the objective function given in Eq. (25) is strictly concave for $\alpha>1$ . As the maximization of a concave function is a convex optimization problem, efficient algorithms to solve the problem stated in Eq. (26) can be utilized, cf. [39, Chap. 3]. This also guarantees a unique solution, which corresponds to a global optimum [40, 41].

IV Dynamic Stream Weight Prediction Models

To deploy the proposed state estimation framework in actual application scenarios, a remaining issue has to be solved: ODSWs can only be obtained for fully-observed models. Hence, DSWs must be estimated from available instantaneous sensor reliability measures. This procedure has already been established for DSW-based models in audiovisual ASR, where, for instance, the instantaneous estimated acoustic signal-to-noise ratio (SNR) was used as such a measure [20].

Let $\hat{\boldsymbol{\lambda}}_{k}=g(\boldsymbol{z}_{k},\,\boldsymbol{\theta})$ denote the general structure of a prediction model with parameters $\boldsymbol{\theta}$ , where the predicted DSWs at time step $k$ are denoted as $\hat{\boldsymbol{\lambda}}_{k}=\begin{bmatrix}\hat{\lambda}_{1,k}&\cdots&\hat{\lambda}_{M,k}\end{bmatrix}^{\mathrm{T}}$ and $\boldsymbol{z}_{k}$ is a vector of reliability measures or, more generally, features that describe the instantaneous measurement uncertainty associated with the corresponding sensors. The prediction model can be any nonlinear function, with the constraint that it is differentiable w.r.t. its parameters. Additionally, the individual function outputs must sum to one, cf. Sec. II-A. Prominent models that match these requirements are e.g. logistic functions or a neural network with softmax output layer.

Supervised training of DSW prediction models requires the availability of a training dataset, where ground-truth information about the state $\mathcal{X}_{\mathrm{train}}$ , the corresponding observations $\mathcal{Y}_{\mathrm{train}}$ and the associated reliability measures $\mathcal{Z}_{\mathrm{train}}$ are available. Although the Gaussian filtering paradigm utilizes time series data, individual data points used during training can be assumed i.i.d. as ODSWs can be estimated independently for each time step, cf. Sec. III-C.

The training phase is a two-stage process, which is illustrated in Fig. 2. First, ODSWs $\mathcal{L}^{\star}_{\mathrm{train}}$ are estimated for the available training data using the method described in Sec. III-C. Subsequently, supervised training of the model parameters is conducted using reliability measures as inputs and ODSWs as targets. Due to the constraint that DSWs must sum to one and the imposed symmetric Dirichlet prior, the ODSWs can be assumed to stem from a categorical distribution. Therefore, appropriate loss functions that can be exploited here are e.g. the Kullback-Leibler divergence (KLD) or the cross-entropy loss [42]. This allows to utilize a gradient-based optimizer for the second part of the training phase.

V Evaluation

The experimental evaluation in this work focuses on three scenarios: the evaluation of the proposed ODSW estimation technique, the comparison of the proposed framework with related state-of-the-art audiovisual speaker tracking methods and an empirical assessment of the computational complexity compared to the standard EKF.

V-A Audiovisual datasets

Three audiovisual datasets are used to conduct the experiments in this study. These datasets comprise a variety of recording conditions and were acquired using different acoustic and visual sensors. The evaluation data was purposefully selected to reflect a wide range of application scenarios with different dynamics and sensor disturbances. To give a first impression of the dataset variability, some exemplary still images from the video files of the evaluation corpora are depicted in Fig. 3.

The first audiovisual corpus was specifically recorded for this study and will be referred to as the Kinect Audiovisual Localization Corpus (KAVLoC). This dataset contains audiovisual recordings of seven (four male and three female) speakers, acquired in an office room with an average reverberation time of approximately $200\,\mathrm{ms}$ . A Microsoft Kinect™ sensor was positioned on a table at a height of $0.9\,\mathrm{m}$ . The participants were sitting on a chair facing the sensor at a distance of approximately $1.5\,\mathrm{m}$ . Besides being advised to stay seated, they were allowed to move freely during the recordings. Throughout each recording session, the speakers were asked to read out sentences randomly selected from the CSTR VCTK corpus [43], which is composed of over 400 sentences taken from English newspapers. Ten audiovisual sequences of $30\,\mathrm{s}$ duration at two different positions were recorded for each speaker. Acoustic signals were acquired at a frame rate of $16\,\mathrm{kHz}$ using the four-channel microphone array of the Kinect™ sensor. The corresponding video sequences were recorded with a resolution of $640\times 480$ pixels at a rate of $15$ frames per second (FPS). To obtain the ground-truth speaker locations, the position of the speakers faces were manually annotated in the recorded video signals. The total duration of all acquired audiovisual sequences is $35\,\mathrm{min}$ .

The Nao Audiovisual Localization Corpus (NAVLoC) is used as a second dataset. It was already used in a previous work on DSWs for LDSs [19]. The audiovisual recordings in this dataset were obtained using the humanoid robot NAO in a laboratory environment with an average reverberation time of approximately $450\,\mathrm{ms}$ . A computer screen and a loudspeaker were positioned at a distance of $2\,\mathrm{m}$ from the robot. The screen was placed at the same height as the robot’s head. Audiovisual sequences from two male and two female speakers were selected randomly from the GRID corpus [44] and played back over the screen and the loudspeaker. For each speaker, 100 sequences with a duration of $2.5\,\mathrm{s}$ each were recorded using the four-channel microphone array and the upper camera of the NAO robot. For one half of these sequences, the robot was directly facing the screen, whereas for the other half, the robot’s head was turned $21^{\circ}$ to the right to enforce a different relative azimuth to the speaker. A sampling rate of $48\,\mathrm{kHz}$ was used for the acoustic recordings. The video signals were acquired with a resolution of $320\times 240$ pixels at $10$ FPS. As the ground-truth azimuth is directly related to the heading direction of the robot’s head, a manual annotation of the collected audiovisual data was not required. The total duration of the recorded sequences in the NAVLoC dataset is approximately $17\,\mathrm{min}$ . As the captured microphone signals are corrupted by fan noise of the NAO robot, this dataset is especially challenging regarding the acoustic localization performance.

The dataset for multimodal voice activity detection (MVAD) introduced in [45] serves as the third evaluation corpus in this study. It provides audiovisual sequences of single and multiple speakers in an office environment. The recordings were acquired using a Kinect™ sensor for capturing the video signals, whereas the audio was captured with an eight-channel linear microphone array. The audio sampling rate is $44.1\,\mathrm{kHz}$ and the video resolution is $640\times 480$ pixels at $10$ FPS. The duration of the individual recordings ranges from $40\,\mathrm{s}$ to $60\,\mathrm{s}$ , with silent periods of $4\,\mathrm{s}$ to $8\,\mathrm{s}$ in between speech segments. Throughout the recordings, the speakers always face the camera and their position changes only slightly. Out of the 31 audiovisual sequences provided in total, six recordings, where only a single speaker was present, were utilized for the experimental evaluation in this study.

V-B Evaluation metrics and significance tests

To assess the speaker localization and tracking performance, the circular root mean square error (RMSE)

[TABLE]

was employed as an evaluation metric [46], where $\hat{\phi}_{k}$ is the estimated azimuth at time-step $k$ , $\phi_{k}$ is the corresponding ground-truth azimuth angle, $K$ is the total number of time-steps in one test sequence and $k_{0}$ corresponds to the number of frames in the grace period. This metric was calculated individually for each audiovisual test sequence. A grace period with $10\,\%$ of the total sequence length was excluded at the beginning of each sequence to allow the Bayesian filtering frameworks to converge. A one-way analysis of variance (ANOVA) with Bonferroni correction [47] was used to assess statistical significance in all conducted experiments.

V-C Experimental setup

Throughout all experiments, acoustic signals were processed using an acoustic front-end, which includes an initial voice activity detection (VAD), followed by an instantaneous estimation of the SNR and the actual speaker localization. All processing steps were conducted frame-wise at time intervals matching the corresponding video frame rate.

The VAD [48] operates on the first channel of the available microphone signals to distinguish between speech and silence frames. Acoustic localization was performed during speech segments only and skipped otherwise. To obtain the instantaneous SNR, the unbiased minimum mean squared error (MMSE) estimator proposed in [49] was used to estimate the noise power at each time-frequency point. The noise power estimate was also used to enhance the noisy speech signals via Wiener filtering. The gross SNR $\xi_{k}\in\mathbb{R}$ averaged over all channels and frequencies was computed as a reliability measure, corresponding to the acoustic sensor uncertainty. Subsequent acoustic localization was performed on the enhanced speech segments using the steered response power phase transform (SRP-PHAT) algorithm [50]. Visual locations of the speaker’s face were extracted from the recorded video using the Viola-Jones algorithm [51] and converted to azimuth angles based on the calibrated camera images. A visual reliability measure, indicating a potential rotation of the speaker’s head, was derived from the detected face region by horizontally mirroring the image and computing the correlation coefficient $\rho_{k}\in[-1,\,1]$ between the original and the mirrored image [52]. Therefore, the individual vectors of reliability measures used for training the DSW prediction model can be expressed as $\boldsymbol{z}_{k}=\begin{bmatrix}\xi_{k}&\rho_{k}\end{bmatrix}^{\mathrm{T}}$ .

A constant velocity model [53] was used to model the system dynamics

[TABLE]

with

[TABLE]

where $T$ denotes the time between two consecutive discrete time-steps, $\sigma_{v}^{2}=0.3$ is a constant factor and the system state $\boldsymbol{x}_{k}=\begin{bmatrix}\phi_{k}&\dot{\phi}_{k}\end{bmatrix}^{\mathrm{T}}$ is encoded as the azimuthal speaker position $\phi_{k}$ and velocity $\dot{\phi}_{k}$ , respectively. As both acoustic and visual sensors directly observe angular values, a rotating vector model (RVM) [54] represents the circular nature of observed azimuth angles as

[TABLE]

where $\sigma_{w,m}^{2}=0.01$ denotes the observation noise variance of the $m$ -th sensor and $m=\{1,\,2\}$ . It should be noted that the system dynamics are based on a linear model. Hence, the standard KF prediction step can be exploited here. However, the nonlinear observation models must be handled using the corresponding Jacobians to perform EKF-based updates.

The logistic function is utilized as a DSW prediction model

[TABLE]

where $\boldsymbol{w}\in\mathbb{R}^{2}$ is the weight vector and $b\in\mathbb{R}$ is a bias term. This allows to express the DSW prediction model parameters as $\boldsymbol{\theta}\in\{\boldsymbol{w},\,b\}$ . As the number of independent observations is fixed as $M=2$ throughout all experiments, it is sufficient to only predict the first (acoustic) DSW using Eq. (30), as the second (visual) DSW is defined as $\lambda_{2,k}=1-\lambda_{1,k}$ . The model is trained by minimizing the cross-entropy loss using standard stochastic gradient descent (SGD).

VI Results and Discussion

This section describes the results obtained for the three evaluation scenarios investigated in this study. It should be noted that all conducted experiments focus on single-speaker scenarios only. This restriction was chosen on purpose, as it allows to exclusively focus on the localization and tracking performance, without taking into account additional external factors like data association ambiguities, estimating the number of speakers and track-to-track fusion. Multi-speaker tracking is an important issue that must be taken into account for many potential applications. However, as the proposed framework is based on the conventional EKF paradigm, it can be easily extended using existing probabilistic data association techniques, cf. [55]. This is outside the scope of this study and will be investigated in future work.

VI-A Oracle dynamic stream weight performance

The first evaluation scenario focuses on the ODSW estimation technique proposed in this study. To analyze the tracking performance under different sensor reliability conditions, the audiovisual signals from all three datasets are augmented with systematic disturbances. Following the approach introduced in [52], the acoustic signals are perturbed with diffuse white noise at different SNR levels ( $0\,\mathrm{dB}$ , $15\,\mathrm{dB}$ and $30\,\mathrm{dB}$ ), calculated over each sequence. Image rotations of $10^{\circ}$ , $15^{\circ}$ and $20^{\circ}$ are used to simulate disturbances of the visual modality. The average circular RMSE is evaluated for each audiovisual sequence in each condition. The standard EKF with audiovisual observations serves as the baseline. Results for the single-modality EKF with either audio-only or video-only observations are also analyzed for comparison. The proposed ODSW estimation framework is assessed using both the Gaussian prior as proposed in [19], as well as the Dirichlet prior from this work. All experiments were performed following a leave-one-out cross-validation scheme. Tab. I summarizes the results achieved in this evaluation scenario.

The results obtained for the single-modality EKF baselines in the KAVLoC dataset indicates that both acoustic and visual localization achieve similar performance for this corpus. Audiovisual fusion using the standard EKF slightly improves localization accuracy over the individual performances, which suggests that fusing both modalities proves to be beneficial using this dataset. The ODSW-EKF further improves performance compared to the audiovisual EKF. However, this improvement is not statistically significant in the undistorted case and there is only a slight difference between ODSWs obtained with a Gaussian prior and the proposed Dirichlet-prior based ODSW-EKF. Statistically significant improvements were obtained in situations with large disturbances, e.g. $0\,\mathrm{dB}$ SNR and $20^{\circ}$ image rotation. This observation supports the hypothesis that, without proper adaptation, the standard EKF is unable to handle large sensor disturbances effectively. This effect can be observed in all evaluated conditions for this dataset: the performance improvement of the ODSW-EKF over the EKF baseline increases with increasing difference between the single-modality cases.

The particular challenge of the NAVLoC dataset is that both acoustic and visual sensors are already affected by significant disturbances, even in the undistorted case. This is primarily caused by fan noise and reverberation for the audio and low image resolution and bright lighting conditions for the video signal. Hence, the systematic disturbances added to the raw audiovisual signals only have little effect, which is reflected by the results obtained for the single-modality EKFs. Both ODSW-EKFs yield statistically significant improvements over the audiovisual EKF baseline in all cases except for an image rotation of $20^{\circ}$ . However, the audiovisual EKF is also outperformed by the video-only EKF in some conditions, which even achieves localization performance similar to the ODSW-EKFs. This leads to the conclusion that in cases where all available sensors suffer from large disturbances, the standard EKF again fails to perform efficient sensor fusion without adaptation. Additionally, the ODSW-EKF is able to cope with this situation, but is limited by the performance of the best-performing modality.

An improved performance of the ODSW-EKF in terms of the mean azimuth error can also be observed for the MVAD corpus, but due to the small sample size, it is not possible to show statistical significance. A comparison of the single-modality results for this dataset indicates that the visual modality has a largely improved reliability over the acoustic sensors. This again leads to a slightly degraded performance of the audiovisual EKF. A reduced localization error compared to the EKF baseline and the single-modality EKFs is achieved by both ODSW-EKFs in conditions with acoustic disturbance. A disturbance of the visual modality leads to similar performance for all evaluated methods.

For all evaluated datasets, only marginal performance differences between the Gaussian prior-based ODSW-EKF and the proposed Dirichlet prior are present. This indicates that both methods are capable of producing reliable ODSW estimates. However, as discussed in Sec. III, the proposed Dirichlet prior has a plausible probabilistic interpretation and only requires the tuning of a single hyperparameter. Furthermore, empirical observations during the experiments indicated that the method is insensitive to the choice of the concentration parameter to a certain degree. However, this has not been systematically evaluated.

VI-B Audiovisual tracking performance analysis

A comparison of the Bayesian filtering framework proposed in this study with state-of-the-art audiovisual speaker tracking methods is the primary focus of the second evaluation scenario. Four different frameworks were selected as baseline methods: the standard EKF with audiovisual observations, the audiovisual fusion technique based on an iterated EKF as proposed by Gehring et al. [30], the PF-based approach with adaptive particle weighting introduced by Gerlach et al. [31] and the recently proposed framework by Qian et al. [32], which explicitly incorporates sensor reliability measures into the weighting stage of the PF. These methods are compared with the ODSW-EKF with Dirichlet prior and a DSW-EKF with corresponding prediction model based on the logistic function, as introduced in Sec. IV. The model utilizes the acoustic and visual reliability measures described in Sec. V-C. A leave-one-out cross-validation procedure identical to the first evaluation scenario is utilized here. The audiovisual sequences of one speaker served as a test set and the sequences from all other speakers were used for training and validation. This procedure was repeated for all speakers in each dataset. More sophisticated realizations of the DSW prediction model, e.g. a neural network, can also be exploited for this task. However, initial experiments performed throughout the course of this study have shown that models with increased complexity do not yield any significant benefit over the logistic function utilized here, given the limited set of provided reliability measures. A thorough analysis of specific reliability measures or even the end-to-end training of DSW prediction models are beyond the scope of this work. The results for all evaluated methods are shown in Tab. II, where the achieved circular RMSEs are averaged over all systematic disturbance conditions and cross-validation folds.

For the KAVLoC dataset, the proposed ODSW-EKF with Dirichlet prior outperforms all baseline methods with a statistically significant performance benefit. As this result is achieved using a fully observed model, it can be considered an upper bound on performance using this method, which cannot be challenged by the corresponding DSW prediction model. However, even with the rather limited amount of reliability measures utilized in this study, the DSW-EKF yields a tracking performance that is similar to all baseline methods on this dataset. More advanced prediction models and the improved selection of reliability measures might help to shift the DSW-EKF tracking performance closer to the ODSW limit.

Similar results were obtained for the NAVLoC dataset. The DSW-EKF shows a statistically significant performance improvement compared to the PF-based method proposed in [31]. There is only a marginal difference between the DSW-EKF and the ODSW-EKF and it performs better on average than the remaining baseline methods. It should be noted that this dataset is challenging especially for PF-based methods, as the speaker position is fixed throughout all audiovisual sequences. This corresponds to a small process noise, which cannot be handled efficiently using PFs. However, the PF baseline method from [32] shows a performance comparable to the EKF-based methods, which indicates that the MDF approach used in this algorithm is efficient on this dataset.

Lastly, the proposed ODSW-EKF and DSW-EKF frameworks yield similar performance to the PF-based methods on the MVAD dataset. The average achieved azimuth errors indicate that the method from [31] outperforms the algorithms proposed in this study. However, due to the limited sample size, it is difficult to reliably show significant differences on this corpus. The fact that both DSW-EKF and ODSW-EKF algorithms have an identical average performance further indicates that the exploited reliability measures provide a suitable means to perform DSW prediction on this dataset.

VI-C Empirical analysis of computational complexity

The third evaluation scenario aims at empirically analyzing the computational performance of the proposed DSW-EKF compared to the standard EKF. This analysis is conducted using synthetic data, generated from DSs with varying state and observation dimensions. Figs. 4 and 5 depict the results of this analysis, which were obtained from Monte Carlo experiments with $25$ runs per condition using randomly generated observation sequences with $100$ time steps each. All model parameters were set to identity matrices if applicable, yielding linear models without the requirement of explicitly computing state and observation Jacobians. The experiments were conducted on a single desktop computer with an Intel® Core™ i5 processor and $16\,\mathrm{GB}$ RAM running Ubuntu $16.04$ .

The results indicate that the standard EKF is up to four times faster for simple models with low state and observation dimensionality. For increasing state dimension, this performance benefit vanishes and even decreases to similar computational performance for large state spaces with $D_{x}>100$ . A similar effect is present for the observation dimensionality, where the EKF outperforms the DSW-EKF for $D_{y_{m}}\leq 5$ . However, with increasing observation dimensionality and number of independent observations, both EKF and DSW-EKF show similar performance.

VII Conclusion and Outlook

In this study, a framework was introduced that extended the classical notion of dynamical systems with dynamic stream weights. A recursive state estimation scheme based on the Gaussian filtering paradigm was proposed. Additionally, a convex optimization approach to estimate oracle dynamic stream weights in fully observed dynamical systems utilizing a Dirichlet prior was derived. It was evaluated against a previously proposed method based on a Gaussian prior and the standard extended Kalman filter, showing similar performance with a reduced amount of tunable hyperparameters. A generic parameter learning framework for dynamic stream weight estimators was derived on the basis of previously computed oracle dynamic stream weights. A study using three different audiovisual speaker tracking datasets confirmed improved localization performance of the dynamic stream weight-based estimation framework over state-of-the-art methods.

Future research directions will focus on improving dynamic stream weight prediction models via suitable reliability measures. The measures utilized in this study only serve as a starting point and need to be investigated in depth and possibly be adapted to different applications. A thorough of these measures by means of feature selection may yield interesting theoretical insights towards the reliability of audiovisual sensors. Additionally, the extension to multi-speaker scenarios by incorporating probabilistic data association techniques into the tracking framework will make the proposed system suitable for a wider range of technical applications. A particular challenge in such scenarios will be the investigation of speaker-dependent reliability measures. Making the proposed system trainable end-to-end using deep neural networks might be a promising approach to tackle this particular challenge.

Appendix A Solvability Analysis of the System of Linear Matrix Equations in Eq. (17)

The left-hand side of Eq. (17) contains the matrix expression $\boldsymbol{R}+\boldsymbol{U}_{k}\boldsymbol{W}_{k}\boldsymbol{U}_{k}^{\mathrm{T}}$ , which needs to be inverted to obtain a unique solution for the individual Kalman gains. As introduced in Sec. II-B, $\boldsymbol{W}_{k}\in\mathbb{R}^{MD_{x}\times MD_{x}}$ can be expressed as the Kronecker product $\boldsymbol{L}_{k}\otimes\hat{\boldsymbol{\Sigma}}_{k|k-1}$ , where $\boldsymbol{L}_{k}\in\mathbb{R}^{M\times M}$ is given in Eq. (18). Since all rows in $\boldsymbol{L}_{k}$ are linearly dependent, $\mathrm{rank}(\boldsymbol{L}_{k})=1$ . Additionally, $\hat{\boldsymbol{\Sigma}}_{k|k-1}$ is a $D_{x}$ -dimensional covariance matrix with $\mathrm{rank}(\hat{\boldsymbol{\Sigma}}_{k|k-1})=D_{x}$ . Hence, the rank equality of the Kronecker product [56] can be exploited here, which yields $\mathrm{rank}(\boldsymbol{W}_{k})=\mathrm{rank}(\boldsymbol{L}_{k})\cdot~{}\mathrm{rank}(\hat{\boldsymbol{\Sigma}}_{k|k-1})=D_{x}$ . Therefore, $\boldsymbol{W}_{k}$ is singular for $M>1$ , which requires the appropriate form of the binomial inverse theorem [35] to obtain a unique solution for Eq. (17).

Appendix B Proof of Concavity of the Fully-Observed Likelihood function with Dirichlet prior

The first and second derivatives of the log-likelihood function defined in Eq. (25) are

[TABLE]

and

[TABLE]

The second derivative is negative with the constraint $\alpha>1$ for $0<\lambda_{m,k}<1~{}\forall\,m$ . Hence, the first derivative is a strictly monotonically decreasing function in this parameter range, which implies that the log-likelihood function is strictly concave, cf. [39, Chap. 3].

Bibliography56

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] T. N. N. Hossein, S. Mita, and H. Long, “Multi-sensor data fusion for autonomous vehicle navigation through adaptive particle filter,” in IEEE Intelligent Vehicles Symposium , June 2010, pp. 752–759.
2[2] M. Ravanbakhsh, M. Baydoun, D. Campo, P. Marin, D. Martin, L. Marcenaro, and C. S. Regazzoni, “Learning multi-modal self-awareness models for autonomous vehicles from human driving,” in 2018 21st International Conference on Information Fusion (FUSION) , July 2018, pp. 1866–1873.
3[3] K. D. Katyal, M. S. Johannes, T. G. Mc Gee, A. J. Harris, R. S. Armiger, A. H. Firpi, D. Mc Mullen, G. Hotson, M. S. Fifer, N. E. Crone, R. J. Vogelstein, and B. A. Wester, “Harmonie: A multimodal control framework for human assistive robotics,” in 2013 6th International IEEE/EMBS Conference on Neural Engineering (NER) , Nov 2013, pp. 1274–1278.
4[4] E. Ivorra, M. Ortega, M. Alcañiz, and N. Garcia-Aracil, “Multimodal computer vision framework for human assistive robotics,” in 2018 Workshop on Metrology for Industry 4.0 and Io T , April 2018, pp. 1–5.
5[5] A. Fleury, M. Vacher, and N. Noury, “SVM-based multimodal classification of activities of daily living in health smart homes: Sensors, algorithms, and first experimental results,” IEEE Transactions on Information Technology in Biomedicine , vol. 14, no. 2, pp. 274–283, March 2010.
6[6] H. Medjahed, D. Istrate, J. Boudy, J. Baldinger, and B. Dorizzi, “A pervasive multi-sensor data fusion for smart home healthcare monitoring,” in IEEE International Conference on Fuzzy Systems , June 2011, pp. 1466–1473.
7[7] M. S. Hossain, “Patient status monitoring for smart home healthcare,” in 2016 IEEE International Conference on Multimedia Expo Workshops (ICMEW) , July 2016, pp. 1–6.
8[8] G. Potamianos, “Audio-visual automatic speech recognition and related bimodal speech technologies: A review of the state-of-the-art and open problems,” in 2009 IEEE Workshop on Automatic Speech Recognition Understanding , Nov 2009, pp. 22–22.