Robust Binaural Localization of a Target Sound Source by Combining   Spectral Source Models and Deep Neural Networks

Ning Ma; Jose A. Gonzalez; Guy J. Brown

arXiv:1904.03006·eess.AS·April 8, 2019

Robust Binaural Localization of a Target Sound Source by Combining Spectral Source Models and Deep Neural Networks

Ning Ma, Jose A. Gonzalez, Guy J. Brown

PDF

TL;DR

This paper introduces a novel binaural sound localization framework that combines spectral source models with deep neural networks, enhancing accuracy in noisy and reverberant environments through model adaptation and joint explanation of mixed signals.

Contribution

It proposes a new hybrid approach integrating model-based spectral information and DNNs for improved sound source localization, including on-the-fly model adaptation during testing.

Findings

01

Significant performance improvement in noisy, reverberant conditions

02

Effective joint explanation of mixed sound sources

03

Robust localization with on-the-fly model adaptation

Abstract

Despite there being clear evidence for top-down (e.g., attentional) effects in biological spatial hearing, relatively few machine hearing systems exploit top-down model-based knowledge in sound localisation. This paper addresses this issue by proposing a novel framework for binaural sound localisation that combines model-based information about the spectral characteristics of sound sources and deep neural networks (DNNs). A target source model and a background source model are first estimated during a training phase using spectral features extracted from sound signals in isolation. When the identity of the background source is not available, a universal background model can be used. During testing, the source models are used jointly to explain the mixed observations and improve the localisation process by selectively weighting source azimuth posteriors output by a DNN-based localisation…

Tables4

Table 1. TABLE I: Room characteristics of the Surrey BRIR database [ 29 ] .

	Room A	Room B	Room C	Room D
$T_{60}$ (s)	0.32	0.47	0.68	0.89
$DRR$ ( $dB$ )	6.09	5.31	8.82	6.12

Table 2. TABLE II: Descriptions of masker sounds used in Noise Set A.

alarm	car alarm sound, rhythmic moderate-narrow-band signal between 600 Hz and 3 kHz.
drums	drumming sound, strong onsets synchronised across frequency, and with significant energy distributed between 100–300 Hz.
car engine	highly modulated at 15 Hz with most energy distributed below 300 Hz.
piano	fast playing solo piano sound, with most energy distributed below 2 KHz.
baby crying	crying baby sound, high pitch, less rhythmic with harmonics often lasting longer than 1 second.
16-talker babble	created from 16 talkers randomly selected from the TIMIT corpus [31], mostly overlapping, speech frequency range.

Table 3. TABLE III: Descriptions of masker sounds used in Noise Set B.

telephone ring	telephone ring, periodic and narrow-band sound with significant energy at around 1 kHz and above 3 kHz.
32-talker babble	created from 32 talkers randomly selected from the TIMIT corpus, mostly overlapping, speech frequency range.

Table 4. TABLE IV: The number of Gaussian mixture components used for each source model.

Target	Noise Set A
speech	alarm	drum	car	piano	baby	8-talker	UBM
16	2	2	2	3	3	4	8

Equations43

P (ϕ ∣ o_{t}) = \frac{\prod _{f} P ( ϕ ∣ o _{t f} ) ^{ω_{t f}}}{P ( o _{t} )}

P (ϕ ∣ o_{t}) = \frac{\prod _{f} P ( ϕ ∣ o _{t f} ) ^{ω_{t f}}}{P ( o _{t} )}

P (o_{t}) = \sum_{ϕ} \prod_{f} P (ϕ ∣ o_{t f})^{ω_{t f}}

P (o_{t}) = \sum_{ϕ} \prod_{f} P (ϕ ∣ o_{t f})^{ω_{t f}}

P (ϕ ∣ o_{1 \dots T}) = \frac{1}{T} t \sum t + T - 1 P (ϕ ∣ o_{t})

P (ϕ ∣ o_{1 \dots T}) = \frac{1}{T} t \sum t + T - 1 P (ϕ ∣ o_{t})

\hat{ϕ} = ϕ ar g max P (ϕ ∣ o_{1 \dots T})

\hat{ϕ} = ϕ ar g max P (ϕ ∣ o_{1 \dots T})

y_{f} \approx max (x_{f}, n_{f})

y_{f} \approx max (x_{f}, n_{f})

ω_{f} ≜ P (x_{f} \geq n_{f} ∣ y) \equiv P (x_{f} = y_{f}, n_{f} \leq y_{f} ∣ y)

ω_{f} ≜ P (x_{f} \geq n_{f} ∣ y) \equiv P (x_{f} = y_{f}, n_{f} \leq y_{f} ∣ y)

p (y ∣ λ_{x}) = k \sum K_{x} P (k ∣ λ_{x}) N (y; μ_{x}^{(k)}, Σ_{x}^{(k)})

p (y ∣ λ_{x}) = k \sum K_{x} P (k ∣ λ_{x}) N (y; μ_{x}^{(k)}, Σ_{x}^{(k)})

ω_{f} = k_{x} \sum k_{n} \sum γ^{(k_{x}, k_{n})} P (x_{f} = y_{f}, n_{f} \leq y_{f} ∣ k_{x}, k_{n}, y)

ω_{f} = k_{x} \sum k_{n} \sum γ^{(k_{x}, k_{n})} P (x_{f} = y_{f}, n_{f} \leq y_{f} ∣ k_{x}, k_{n}, y)

γ^{(k_{x}, k_{n})} ≜ P (k_{x}, k_{y} ∣ y) = \frac{p ( y ∣ k _{x} , k _{n} ) P ( k _{x} ) P ( k _{n} )}{\sum _{k_{x}^{'}, k_{n}^{'}} p ( y ∣ k _{x}^{'} , k _{n}^{'} ) P ( k _{x}^{'} ) P ( k _{n}^{'} )}

γ^{(k_{x}, k_{n})} ≜ P (k_{x}, k_{y} ∣ y) = \frac{p ( y ∣ k _{x} , k _{n} ) P ( k _{x} ) P ( k _{n} )}{\sum _{k_{x}^{'}, k_{n}^{'}} p ( y ∣ k _{x}^{'} , k _{n}^{'} ) P ( k _{x}^{'} ) P ( k _{n}^{'} )}

p (y ∣ k_{x}, k_{n}) = f = 1 \prod D p (y_{f} ∣ k_{x}, k_{n})

p (y ∣ k_{x}, k_{n}) = f = 1 \prod D p (y_{f} ∣ k_{x}, k_{n})

p (y_{f} ∣ k_{x}, k_{n}) = \iint p (y_{f} ∣ x_{f}, n_{f}) p (x_{f} ∣ k_{x}) p (n_{f} ∣ k_{n}) d x_{f} d n_{f}

p (y_{f} ∣ k_{x}, k_{n}) = \iint p (y_{f} ∣ x_{f}, n_{f}) p (x_{f} ∣ k_{x}) p (n_{f} ∣ k_{n}) d x_{f} d n_{f}

p (y_{f} ∣ x_{f}, n_{f})

p (y_{f} ∣ x_{f}, n_{f})

= δ (y_{f} - x_{f}) \mathds 1_{n_{f} \leq x_{f}} + δ (y_{f} - n_{f}) \mathds 1_{x_{f} < n_{f}}

p (y_{f} ∣ k_{x}, k_{n}) = p_{x} (y_{f} ∣ k_{x}) C_{n} (y_{f} ∣ k_{n}) + p_{n} (y_{f} ∣ k_{n}) C_{x} (y_{f} ∣ k_{x})

p (y_{f} ∣ k_{x}, k_{n}) = p_{x} (y_{f} ∣ k_{x}) C_{n} (y_{f} ∣ k_{n}) + p_{n} (y_{f} ∣ k_{n}) C_{x} (y_{f} ∣ k_{x})

P (x_{f} = y_{f}, n_{f} \leq y_{f} ∣ k_{x}, k_{n}, y)

P (x_{f} = y_{f}, n_{f} \leq y_{f} ∣ k_{x}, k_{n}, y)

= \frac{p ( x _{f} = y _{f} , n _{f} \leq y _{f} ∣ k _{x} , k _{n} )}{p ( x _{f} = y _{f} , n _{f} \leq y _{f} ∣ k _{x} , k _{n} ) + p ( n _{f} = y _{f} , x _{f} < y _{f} ∣ k _{x} , k _{n} )}

= \frac{p _{x} ( y _{f} ∣ k _{x} ) C _{n} ( y _{f} ∣ k _{n} )}{p ( y _{f} ∣ k _{x} , k _{n} )}

ω_{f} = k_{x}, k_{n} \sum \frac{γ ^{(k_{x}, k_{n})} p _{x} ( y _{f} ∣ k _{x} ) C _{n} ( y _{f} ∣ k _{n} )}{p _{x} ( y _{f} ∣ k _{x} ) C _{n} ( y _{f} ∣ k _{n} ) + p _{n} ( y _{f} ∣ k _{n} ) C _{x} ( y _{f} ∣ k _{x} )} .

ω_{f} = k_{x}, k_{n} \sum \frac{γ ^{(k_{x}, k_{n})} p _{x} ( y _{f} ∣ k _{x} ) C _{n} ( y _{f} ∣ k _{n} )}{p _{x} ( y _{f} ∣ k _{x} ) C _{n} ( y _{f} ∣ k _{n} ) + p _{n} ( y _{f} ∣ k _{n} ) C _{x} ( y _{f} ∣ k _{x} )} .

p (y ∣ λ_{n}) = k \sum P (k ∣ λ_{n}) N (y; μ_{n}^{(k)} + β, Σ_{n}^{(k)})

p (y ∣ λ_{n}) = k \sum P (k ∣ λ_{n}) N (y; μ_{n}^{(k)} + β, Σ_{n}^{(k)})

Q (β, \hat{β}) = t \sum k_{x} \sum k_{n} \sum γ_{t}^{(k_{x}, k_{n})} f \sum lo g p (y_{t f} ∣ k_{x}, k_{n}, β_{f})

Q (β, \hat{β}) = t \sum k_{x} \sum k_{n} \sum γ_{t}^{(k_{x}, k_{n})} f \sum lo g p (y_{t f} ∣ k_{x}, k_{n}, β_{f})

β_{f} = \frac{1}{T} t \sum k_{x} \sum k_{n} \sum γ_{t}^{(k_{x}, k_{n})} (\overline{n}_{t f}^{(k_{x}, k_{n})} - μ_{n f}^{(k_{n})})

β_{f} = \frac{1}{T} t \sum k_{x} \sum k_{n} \sum γ_{t}^{(k_{x}, k_{n})} (\overline{n}_{t f}^{(k_{x}, k_{n})} - μ_{n f}^{(k_{n})})

\overline{n}_{t f}^{(k_{x}, k_{n})}

\overline{n}_{t f}^{(k_{x}, k_{n})}

\tilde{n}_{t f}^{(k_{n})} = \int_{- \infty}^{y_{t f}} n_{t f} p (n_{t f} ∣ k_{n}, \hat{β}_{f}) d n_{t f} .

\tilde{n}_{t f}^{(k_{n})} = \int_{- \infty}^{y_{t f}} n_{t f} p (n_{t f} ∣ k_{n}, \hat{β}_{f}) d n_{t f} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Robust Binaural Localization of a Target Sound Source by Combining Spectral Source Models and Deep Neural Networks

Ning Ma, Jose A. Gonzalez and Guy J. Brown This work was supported by the EC FP7 project Two!Ears under grant agreement No. 618075. N. Ma and G. J. Brown are with the Department of Computer Science, University of Sheffield, Sheffield, UK, S1 4DP (email: {n.ma, g.j.brown}@sheffield.ac.uk). J. A. Gonzalez is with the Department of Languages and Computer Sciences, University of Malaga, Campus de Teatinos, Malaga 29071, Spain (email: [email protected]). This work was carried out while Gonzalez was working at the University of Sheffield.

Abstract

Despite there being clear evidence for top-down (e.g., attentional) effects in biological spatial hearing, relatively few machine hearing systems exploit top-down model-based knowledge in sound localisation. This paper addresses this issue by proposing a novel framework for binaural sound localisation that combines model-based information about the spectral characteristics of sound sources and deep neural networks (DNNs). A target source model and a background source model are first estimated during a training phase using spectral features extracted from sound signals in isolation. When the identity of the background source is not available, a universal background model can be used. During testing, the source models are used jointly to explain the mixed observations and improve the localisation process by selectively weighting source azimuth posteriors output by a DNN-based localisation system. To address the possible mismatch between training and testing, a model adaptation process is further employed on-the-fly during testing, which adapts the background model parameters directly from the noisy observations in an iterative manner. The proposed system therefore combines model-based and data-driven information flow within a single computational framework. The evaluation task involved localisation of a target speech source in the presence of an interfering source and room reverberation. Our experiments show that by exploiting model-based information in this way, sound localisation performance can be improved substantially under various noisy and reverberant conditions.

Index Terms:

binaural source localisation, machine hearing, reverberation, sound source combination, masking

I Introduction

It has been long established that human listeners use both bottom-up (data-driven) and top-down (model-based) information in order to understand an acoustic scene (e.g., [1]). More specifically, both kinds of information are needed in order to answer ‘what’ and ‘where’ questions about the acoustic scene, i.e. what the identities of the sound sources are, and where they are in space. Many machine hearing studies have proposed computational approaches for answering these questions, by combining techniques for source separation, classification and sound localisation [2]. However, in such machine systems, data-driven and model-based mechanisms are typically much less well-integrated than they appear to be in biological hearing. The current paper addresses this issue, by proposing a novel machine hearing system that tightly integrates knowledge about source spectral characteristics with a mechanism for binaural localisation. The resulting system improves binaural sound localisation under challenging acoustic conditions in which multiple sound sources and room reverberation are present.

Psychophysical studies of human hearing have found evidence for top-down effects in sound localisation. For example, listeners take less time to localise a target sound when it is preceded by a cue on the same side of the head, suggesting that sound localisation is enhanced by covert orienting [3]. More recently, based on a review of psychophysical data, Bronkhorst [4] proposed a conceptual model of hearing in which top-down models inform the selection of binaural cues needed for localisation of specific sounds. Physiological studies in animals have also shown that sound localisation is modulated by top-down mechanisms. Studies of the barn owl have shown that selective attention influences sound localisation, including orienting behaviour such as body and head movements. More specifically, neural responses in the midbrain of the owl are enhanced if they are associated with the location of behaviourally relevant stimuli, such as a source of food [5]. Likewise, neural circuitry for gaze control can exert a top-down effect on the responses of auditory neurons that are turned to particular spatial locations [6]. In summary, these psychological and physiological findings suggest that mechanisms of spatial hearing are tightly integrated with top-down, cross-modal processing in the brain.

In contrast, relatively few systems for machine hearing exploit top-down model-based information in source localisation. In [7], Ma et al. proposed a robust binaural localisation system based on deep neural network (DNN) for localisation of multiple sources. However, no source models were used and the system reported azimuth estimates for all directional sources. In [8], Christensen et al. exploit a pitch cue to identify local spectro-temporal fragments that are considered to be dominated by a single source, before integrating binaural cues over such fragments. The integrated binaural cues were more robust for localising multiple speakers, but no explicit model-based information was exploited. Mandel et al. [9] have proposed a probabilistic model for joint sound source localisation and separation based on interaural spatial cues and binary masking. Given the number of sources, the system iteratively updates Gaussian mixture models (GMMs) of spatial cues using the expectation-maximization (EM) algorithm. However, the benefit of the system was demonstrated in terms of source separation rather than localisation. Two systems for binaural localisation of multiple sources recently proposed by [10, 11] use statistical frameworks to jointly perform localisation and pitch-based segregation. However, they do not use information about source characteristics other than through statistical models of pitch dynamics and binaural cues. In [12, 13], model-based information is explicitly employed together with binaural cues, but the task in those studies was automatic speech recognition rather than sound localisation. Closest to the approach proposed here is the attention-driven model of sound localisation proposed by [14], in which top-down connections from a cortical model are used to potentiate responses to attended locations. However, attentional control in their model is driven by a simple neural circuit that fixates on sounds arriving from the same spatial location, and their approach is currently unable to localise multiple sound sources.

The current paper proposes a framework for sound localisation in which model-based information about the spectral characteristics of sound sources in the acoustic scene is used to selectively weight binaural cues. Models for the target and the background sources are first estimated in a training stage, using spectral features computed from source signals in isolation. When the identity of the background source is not available, a universal background model can be used. The source models are then used during the testing stage to jointly explain the mixed observations in terms of which spectral features belong to each source (target or masker), and thereby improve the localisation process by selectively weighting binaural cues in each time-frequency bin. In our previous work [15] model-based information was incorporated in a GMM-based localisation system, whereas in this study it was used in a state-of-the-art DNN-based system [7].

The proposed system also addressed the possible mismatch due to power level differences between training and testing. An iterative adaptation process is employed on-the-fly to estimate a frequency-dependent scaling factor for the background model parameters directly from the mixed signals. The proposed system therefore combines data-driven and model-based information flow within a single computational framework. We show that by exploiting source models in this way, sound localisation performance can be improved under conditions in which multiple sources and room reverberation are present.

The rest of the paper is organised as follows. In the next section we give a system overview and describe features used for source localisation and for modelling source spectral characteristics. A source localisation framework is then presented in Section III which allows model-based information of source spectral characteristics to be incorporated. Evaluation and experiments are described in Section IV. Section V discusses the results in detail. Finally, Section VI concludes the paper and makes suggestions for future work.

II System overview

Figure 1 shows a schematic diagram of the proposed binaural sound localisation system. A target source and an interfering source (masker) are spatialised in a virtual environment, which allows sounds to be placed in the full 360 $\,{{}^{\circ}}$ azimuth range around a virtual listener. In this study, the prior knowledge of the target source is assumed to be known so that the system can “attend” to the target for localisation. No prior knowledge of the masker is assumed when a universal background model (UBM) is used, but the knowledge of the masker can be exploited by the framework when available (see Section III-B for more details).

The binaural ear signals are analysed by a binaural auditory model, which consists of a bank of 32 overlapping Gammatone filters with centre frequencies uniformly spaced on the equivalent rectangular bandwidth (ERB) scale between 80 Hz and 8 kHz [2]. Inner hair cell function was approximated by half-wave rectification. Subsequently, a cross-correlation between the right and left ears was computed independently for each of the 32 frequency bands using overlapping frames of 20 ms with a shift of 10 ms. The cross-correlation function (CCF) was further normalised by the autocorrelation value at lag zero as described in [16] and evaluated for time lags in the range of $\pm$ 1.1 ms.

Two binaural features, interaural time differences (ITDs) and interaural level differences (ILDs), are typically used in binaural localisation systems [17]. ITD is estimated as the lag corresponding to the maximum in the CCF. However, it has been shown that the CCF outperforms the ITDs in state-of-the-art DNN-based localisation systems [18, 7]. Thus, in this work we use the entire CCF as localisation features. For signals sampled at a rate of 16 kHz, the CCF with a lag range of $\pm 1$ ms produced a 33-dimensional binaural feature space for each frequency band. This was supplemented by the ILD, which corresponds to the energy ratio between the left and right ears within each analysis window, expressed in $\mathrm{d}\mathrm{B}$ . The final 34-dimensional (34D) feature vectors were passed to a DNN-based localisation system which estimates posterior probabilities for each azimuth in the full 360 $\,{{}^{\circ}}$ azimuth range. DNNs were trained using sound mixtures consisting of the target and masker sources rendered in a virtual acoustic environment.

Ratemap features [19] were used to model source spectral characteristics. A ratemap is a spectro-temporal representation of auditory nerve firing rate, extracted from the inner hair cell output of each frequency channel by leaky integration and downsampling (see Fig. 2 for examples). For the binaural signals used here, the ratemap features were computed for each ear channel separately and then averaged across the two ears. They were finally log-compressed to form 32D feature vectors. The source ratemap features were exploited together with a target source model and a background source model to estimate a set of localisation weights, in order to bias the system towards localising the target source. In the next section we describe the localisation framework in detail.

III Localisation with top-down source models

In the following we use $P$ for probability, $p$ for density functions and $C$ for cumulative distribution functions.

III-A Localisation model

DNNs were used to model statistical relationships between the binaural features and corresponding azimuth angles [18, 7]. Unlike many other binaural sound localisation systems (e.g. [20, 21]), this study does not assume that sound sources are located only in the frontal-hemifield. Instead it considers 72 azimuth angles $\phi$ in the full $360{{}^{\circ}}$ azimuth range ( $5{{}^{\circ}}$ steps).

A separate DNN was trained for each of the 32 frequency bands, largely because this study is concerned with localisation of multiple sources. Although simultaneous sources overlap in time, within a local time frame each frequency band is mostly dominated by a single source111See Bregman’s notion of ‘exclusive allocation’ [1]. Note in reverberant environments this is only a crude approximation. By adopting a separate DNN for each frequency band, it is possible to train the DNNs using single-source data without having to construct multi-source training data.

The DNN model was largely based on a state-of-art binaural localisation system described in [7]. In brief, each DNN consists of an input layer, two hidden layers, and an output layer. The input layer contained 34 nodes and each node was assumed to be a Gaussian random variable with zero mean and unit variance. The hidden layers had sigmoid activation functions with 128 hidden nodes. The output layer contained 72 nodes corresponding to the 72 azimuth angles in the full 360 $\,{{}^{\circ}}$ azimuth range, with a 5 $\,{{}^{\circ}}$ step. A ‘softmax’ activation function was applied at the output layer.

Given the observed localisation feature vector $\bm{{o}}_{tf}$ (the 34D localisation features described in Section II) at time frame $t$ and frequency band $f$ , the posterior probability of azimuth angle $P(\phi|\bm{{o}}_{tf})$ is estimated using the DNN trained for each frequency band $f$ . The posteriors are then integrated across frequency to produce the probability of azimuth $\phi$ given features $\bm{{o}}_{t}=[\bm{{o}}_{t1}^{\top},\dots,\bm{{o}}_{t32}^{\top}]^{\top}$ of the entire frequency range at time $t$ ,

[TABLE]

where

[TABLE]

Here $\omega_{tf}\in[0,1]$ is introduced for selectively weighting the contribution of binaural cues from each time-frequency bin in order to localise the attended target source in the presence of competing sources. When $\omega_{tf}$ is [math] the time-frequency bin will be excluded from localisation of the target source. The next section will discuss in detail how model-based knowledge about sound sources can be used to jointly estimate the weighting factors.

Assuming that the target sound source is stationary, the frame posteriors are further averaged across time to produce a posterior distribution $P(\phi)$ of sound source activity given a segment of signal consisting of $T$ time frames

[TABLE]

The target location is considered to be the azimuth $\hat{\phi}$ that maximises Eq. (3)

[TABLE]

III-B Exploiting model-based information

In the presence of multiple sources, the binaural cues computed from spectro-temporal regions dominated by the masking sources will bias the localisation decision towards the location of the maskers, thus leading to localisation errors. To address this issue, we propose an ‘attentional’ mechanism that exploits prior information about the spectral characteristics of the acoustic sources in order to weight the binaural cues towards localising the target source. First, the prior source models are used to estimate the probability of each time-frequency (T-F) bin of the observed signal being dominated by the energy of the target source. These probabilities are then employed during sound localisation to selectively weight the binaural cues: cues extracted from T-F regions dominated by the masking sources are penalised during localisation.

Let $\bm{{y}}_{t}=[y_{t1},\dots,y_{tD}]$ denote the ratemap feature vector (see Section II) extracted from the observed signals at frame $t$ , where $D$ is the feature dimension. For simplification we will omit the dependence on the time index $t$ for the remainder of this section. Similarly, let $\bm{{x}}$ and $\bm{{n}}$ denote the ratemap feature vectors of the underlying target and masker, respectively. In the log-ratemap domain, the relationship between these variables can be approximated as

[TABLE]

where $f$ is the frequency band within $[1\dots D]$ . This is known as the log-max model [22, 23], which is an approximation of the exact interaction model between two sound sources when they are expressed in a log-compressed spectral domain. According to this model, the effect of the masker sound on the target can be modelled as a kind of spectral masking.

Let $\bm{{\omega}}$ denote the localisation weights to be estimated, and each of its elements $\omega_{f}\in[0,1]$ indicates whether $y_{f}$ is dominated by the energy of the target source $x_{f}$ ( $\omega_{f}\approx 1$ ) or the masker source $n_{f}$ ( $\omega_{f}\approx 0$ ). From a probabilistic point of view, and under the restrictions imposed by the log-max model, $\omega_{f}$ corresponds to the following a posteriori probability

[TABLE]

To estimate this probability, prior models for the spectral characteristics of the sound sources are used. Each sound source $s=1,\ldots,\mathcal{S}$ is modelled using a GMM $\lambda_{s}$ with $K_{s}$ mixtures. Because the identity of the target source is known, the probability of the observation given the target source GMM $\lambda_{x}$ is computed simply as

[TABLE]

where $\bm{{\mu}}_{x}^{(k)}$ and $\bm{{\Sigma}}_{x}^{(k)}$ are the mean and covariance of the $k^{th}$ component for the target source GMM.

For the case of the masker source we will distinguish two cases. First, if there is only one masker source in the acoustic scene and its identity is known a priori, $p(\bm{{y}}|\lambda_{n})$ can be computed as in (7) but using the GMM for the known masker, $\lambda_{n}$ . If the identity of the masker is unknown, we estimate a UBM $\lambda_{ubm}$ using signals from various sources (see Section IV-D for more details about the estimation of the UBM).

Using $p(\bm{{y}}|\lambda_{x})$ and $p(\bm{{y}}|\lambda_{n})$ , Eq. (6) for computing the localisation weights $\omega_{f}$ can be rewritten as follows (see [24, 25, 26]):

[TABLE]

Here, $k_{x}$ and $k_{n}$ denote the index for the mixture components in $\lambda_{x}$ and $\lambda_{n}$ , $P(x_{f}=y_{f},n_{f}\leq y_{f}|k_{x},k_{n},\bm{{y}})$ is the target presence probability (TPP) and $\gamma^{(k_{x},k_{n})}$ is defined as the posterior probability

[TABLE]

Here we have omitted the dependence on the models $\lambda_{x}$ and $\lambda_{n}$ in order to simplify the notation.

Assuming the frequency bands are conditionally independent given the mixture components, $p(\bm{{y}}|k_{x},k_{n})$ can be expressed as

[TABLE]

with $p(y_{f}|k_{x},k_{n})$ being the following marginal distribution

[TABLE]

The terms $p(x_{f}|k_{x})$ and $p(n_{f}|k_{n})$ of the equation can be directly calculated using the GMMs $\lambda_{x}$ and $\lambda_{n}$ . Using the log-max model, $p(y_{f}|x_{f},n_{f})$ can be expressed as

[TABLE]

where $\delta(\cdot)$ is the Dirac delta function and $\mathds{1}_{\mathcal{C}}$ is an indicator function which equals 1 when the condition $\mathcal{C}$ is true and 0 otherwise. As shown in [26], (11) then becomes

[TABLE]

where $p_{x}$ and $p_{n}$ are the Gaussian probability functions and $C_{x}$ and $C_{n}$ are the corresponding cumulative distribution functions.

Using Bayes’ rule, the TPP in (8) can be written as

[TABLE]

Finally, using (9) and (14), the localisation weight (8) becomes a weighted average of the TPPs in (14) for all possible combinations of $(k_{x},k_{n})$

[TABLE]

III-C Adaptation to power level differences

The mismatch in the power level of acoustic sources between training and testing conditions may cause the source GMM to inaccurately represent the spectral characteristics of the sources in the testing condition. To alleviate this problem, we introduce an algorithm that adapts the source GMMs directly from the noisy signals on-the-fly, before estimating the localisation weights from the same signal. To simplify the discussion, here we only consider adaptation of the GMM for the interfering sources (i.e. we assume that the power level of the target source does not change). However, the extension of the procedure to also adapt the target source GMM is straightforward.

We assume that all the mixture compoments in the GMM are adapted by the same level as follows:

[TABLE]

where $\bm{{\beta}}=[\beta_{1},\dots,\beta_{D}]$ is the vector that accounts for the power level differences in each frequency channel. Note that only the means are adapted and variances are kept the same. To determine $\bm{{\beta}}$ directly from the noisy observations, we resort to an iterative approach based on the EM algorithm [27]. Let $\hat{\bm{{\beta}}}$ denote the current estimate of $\bm{{\beta}}$ . The function to be optimised is

[TABLE]

where $\gamma_{t}^{(k_{x},k_{n})}$ is the posterior probabillity in (9) and is computed using the current estimate $\hat{\bm{{\beta}}}$ . $p(y_{tf}|k_{x},k_{n},\beta_{f})$ is given by (13).

By setting the derivative $\partial Q(\bm{{\beta}},\hat{\bm{{\beta}}})/\partial\beta_{f}$ equal to zero, we obtain the following updating equation for $\beta_{f}$ :

[TABLE]

where $\overline{n}_{tf}^{(k_{x},k_{n})}$ is the estimate of $n_{tf}$ ,

[TABLE]

In the last equation we use the short-hand notation $\alpha_{tf}^{(k_{x},k_{n})}\triangleq P(x_{f}=y_{f},n_{f}\leq y_{f}|k_{x},k_{n},\hat{\beta}_{f},\bm{{y}})$ for the TPP in (14). $\tilde{n}_{tf}^{(k_{n})}$ represents the estimate for $n_{tf}$ under the assumption that $n_{tf}$ is masked by the target source energy $x_{tf}$ . In this case, $n_{tf}$ is upper-bounded by the observation $y_{tf}$ . Thus, $\tilde{n}_{tf}^{(k_{n})}$ corresponds to the following expected value,

[TABLE]

From (18), it can be seen that the updated value for $\beta_{f}$ is a weighted average of the power level differences between the estimates $\overline{n}_{tf}^{(k_{x},k_{n})}$ and the GMM means $\mu_{nf}^{(k_{n})}$ . This equation is iteratively applied until $\beta_{f}$ converges.

IV Evaluation

To evaluate the proposed binaural localisation system, a virtual acoustic environment was created which contained a target speech source and a masking source selected from a number of different noise types.

IV-A Binaural simulations

Binaural audio signals were created by convolving monaural signals with head related impulse responses (HRIRs) or binaural room impulse responses (BRIRs). An HRIR catalog based on the Knowles Electronic Manikin for Acoustic Research (KEMAR) dummy head [28] was used for simulating the anechoic training signals. The evaluation stage used the Surrey BRIR database [29] to simulate reverberant room conditions. The Surrey database was captured using a Cortex head and torso simulator (HATS) and includes four room conditions with various amounts of reverberation (see Table I for the reverberation time (T60) and the direct-to-reverberant ratio (DRR) of each room). Binaural mixtures of two simultaneous sources were created by convolving each source signal with BRIRs separately before adding them together in each of the two binaural channels.

IV-B Target and masker signals

The target source signals were drawn from the GRID speech corpus [30]. Each GRID sentence is approximately $2$ sec long and has a six-word form (e.g., “lay red at G 9 now”). Both a male talker and a female talker were selected as the target speech source. All target signals were sampled at 16 kHz.

In our previous study [15], all noise types were used to estimate the UBM. In order to investigate how well the system is able to deal with unseen noise types, two sets of noise signals were used as the masker source. Noise Set A consists of six masker types with various degrees of spectro-temporal complexities, and were used to estimate the UBM. Noise Set B consists of two additional masker types as the “unseen noise” set, i.e. they were excluded for training the UBM. The details of both noise sets are summarised in Tables II and III. In all cases, noise signals were 90 seconds long and were sampled at 16 kHz. Fig. 2 shows example ratemap representations of these noise signals.

IV-C Localisation DNN training

As shown in previous studies [10, 21, 32], multi-conditional training (MCT) can increase the robustness of localisation systems in reverberant multi-source conditions. To train the DNNs, in this study a MCT dataset was created by mixing the training signals at a specified azimuth with diffuse noise as described in [32]. The diffuse noise consisted of 72 uncorrelated, white Gaussian noise sources that were placed across the full 360∘ azimuth range in steps of $5{{}^{\circ}}$ . Both the target signals and the diffuse noise were created by using the same anechoic HRIR recorded using a KEMAR dummy head [28].

Following [7], the localisation training data consisted of speech sentences from the TIMIT database [31]. A set of 30 speech sentences was randomly selected for each of the 72 azimuth locations, equally distributed in the 360∘ range. For each spatialised training sentence, the anechoic signal was corrupted with diffuse noise at three signal-to-noise ratios (SNRs) ( $20$ , $10$ and $0\,\mathrm{d}\mathrm{B}$ ). The corresponding binaural features (ITDs, CCF, and ILDs) were then extracted. Only those features for which the a priori SNR between the target and the diffuse noise exceeded $-5\,\mathrm{d}\mathrm{B}$ were used for training. This negative SNR criterion ensured that the multi-modal clusters in the binaural feature space at higher frequencies, which are caused by periodic ambiguities in the cross-correlation function, were properly captured. The localisation DNNs were not retrained for the target source drawn from the GRID speech corpus.

IV-D Source model training

This study is concerned with binaural sound localisation. In order to learn a source spectral model in a binaural setting, binaural source signals were created by convolving monaural signals with the Room A BRIR from the Surrey database [29] for azimuths ranging between $\left[-90{{}^{\circ}},90{{}^{\circ}}\right]$ in $5{{}^{\circ}}$ steps. Ratemap features were first extracted from each of the binaural channels and then averaged across the two channels. The ratemap features for all the azimuths were used to train source models using the EM algorithm as described in Section III-B.

The identity of the attended target source is assumed known a priori in this study. For the target speech source, 90 seconds of training data were used to train a 16-mixture target GMM.

For each noise source in Noise Set A, 4/5 of the 90-second signal was used as the training data and the remaining 1/5 was used as the test data. When the identity of the masker source is unavailable, the system uses a universal background model and performs adaptation on-the-fly during testing in order to better match the spectral profile of the masker source in a mixed test signal. All the training data from Noise Set A were used to train a 16-mixture UBM. Noises in Set B were excluded in this process.

If the identity of the masker source is also available a priori, the system can directly exploit the corresponding masker model. The number of mixtures for each GMM was selected based on its spectro-temporal complexity and is listed in Table IV.

IV-E Experimental setup

The evaluation set contained $50$ GRID sentences from the target speech source which were not included in training. For each target signal, a masker signal that matched the length of the target was randomly selected from the test set. The target source was then mixed with the masker signal in a binaural setting. As shown in Fig. 3, the target source varied in azimuth within the range of $\left[-90{{}^{\circ}},90{{}^{\circ}}\right]$ in $10{{}^{\circ}}$ steps. However, this knowledge was not available to the systems and the full 360∘ azimuth range could be reported (and hence front-back errors could occur, as shown in Fig. 3).

The azimuth of the masker was randomly selected each time from the same azimuth range, while ensuring an angular distance of at least $10{{}^{\circ}}$ between the two competing sources. Source locations were limited to this azimuth range because the Surrey database only includes azimuths in the frontal hemifield. However, our localisation system was not provided in any case with information that the azimuth of the source lay within this range; it was free to report the azimuth within the full $360{{}^{\circ}}$ range. Both target and masker signals were normalised by their root mean square (RMS) amplitudes prior to spatialisation at two target-to-masker ratios (TMRs): 0 dB and -6 dB.

The baseline system was a state-of-the-art binaural localisation system using DNNs [7], with no model-based knowledge about the sources. The proposed localisation system exploiting model-based information employed the same localisation DNNs, but was given prior knowledge of the target source so that it could be “attended to” for more accurate localisation. The system was evaluated in three scenarios:

IV-E1 The identity of the masker source was assumed to be available a priori

The corresponding masker source GMM was used as the background model in order to estimate the set of localisation weights $\omega_{tf}$ in Eq. (1). Noise Set A was used for evaluating this scenario.

IV-E2 The masker source identity was unavailable but the source was used for training the UBM

In the second scenario, the UBM estimated from all the noise types in Noise Set A was used as the background model.

IV-E3 The masker source identity was unavailable and the source was not used for training the UBM

Similar to scenario 2, in this scenario the UBM estimated from all the noise types in Noise Set A was used as the background model. However, to test how well the system could generalise unseen noise types, Noise Set B was used for evaluation, which was not used for estimating the UBM.

In all scenarios, the background model (either the correct masker model or the UBM) was employed with or without adaptation.

For all the evaluated systems, the number of competing sources (two in this study) was assumed to be known a priori. Therefore each system reported two estimated source azimuths. The localisation performance was measured for the target source only by comparing true target source azimuths with the estimated azimuths. The target localisation error rate was measured by counting the number of sentences for which one of the azimuth estimates was outside a predefined grace boundary of $\pm 5{{}^{\circ}}$ with respect to the true target azimuths.

V Results and Discussion

The target localisation error rates of various binaural localisation systems are shown in Figs. 4 and 5 for the 0 dB and the -6 dB TMR cases, respectively. The results are averaged between the male speech source case and the female speech source case.

For each of the known masker types (top 6 maskers), the results using both UBM and the respective masker model were shown while for the unknown masker types in Set B only the results using the UBM are shown. Furthermore, since in this study the system did not assume that the sound sources are located in the frontal hemifield, the front-back error rates are shown as white bars in each figure.

The performance of the baseline system, which uses no model-based information for localisation, varied greatly across different masker conditions. The results show that the baseline system performed worse in masker conditions such as ‘phone’, ‘alarm’, and N-talker babble. While it is straightforward to understand why localisation of the target speech is less reliable in N-talker babble, as the masker spectrum overlaps that of the target speech, the poor performance in the ‘phone’ and ‘alarm’ conditions requires further explanation. This is likely due to two reasons. Firstly, these sounds are narrowband as shown in Fig. 2, and the target speech was completely masked at the frequencies where the maskers’s energy dominated. Hence, in these frequency bands only small glimpses were available to localise the target source and the baseline system tended to report high probabilities for the masker azimuths. Secondly, most of the energy for the narrowband sounds is located at high frequencies. The baseline system employed cross-correlation features and ILDs as localisation features, and these features are known to be more reliable in high frequencies [7]. In particular, the ILDs are more pronounced at high frequencies due to the size of the head compared to the wavelength of incoming sounds [17]. When these frequency bands were dominated by the masker, the system was more likely to report the location of the masker, especially when the global TMR was negative. Detailed analysis of the results shows that most localisation errors were due to incorrectly reporting the masker locations.

In contrast, the target localisation error rates were lower when the masker source dominates low frequencies. For example, in the ‘piano’ and ‘drums’ conditions, the target localisation error rates were below 10% at the TMR of 0 dB, and even at the TMR of -6 dB, the localisation error rates were still below 5% in the ‘piano’ condition. This is largely because the maskers dominated frequency regions that were less reliable for localisation with the DNN system, and thus the performance remained robust in the presence of the masker.

When model-based source knowledge was employed (in Figs. 4 and 5, UBM or Masker), the localisation system was able to identify the spectro-temporal regions dominated by the target speech. Therefore the system was more likely to report the location of the target source by weighing those regions more. Figs. 4 and 5 show that the results significantly improve when model-based information was used for localisation, particularly for narrowband sounds such as ‘alarm’.

V-A Effect of using the masker model vs. the UBM

For each masker source in Noise Set A, a corresponding masker model was created. If the masker type is assumed to be known, the correct masker model was used as the background model, otherwise the UBM was used. Across different masker conditions, the use of a background model greatly improved the target localisation accuracy. Comparing the results in Fig. 4 using the UBM and the correct masker model, without adaptation, one can see that at 0 dB TMR using the UBM the system performed comparably with that using the correct masker model. This is largely because the UBM was trained using signals of the masker types from Noise Set A. The use of the GMM was effective to capture the spectral profiles of all the maskers and this helped localise the target source using the proposed framework.

As can be seen in Fig. 5, at -6 dB TMR, the localisation error reduction using the correct masker model becomes larger. This is expected, as a more detailed masker model will produce more reliable localisation weights than the general UBM. However, the use of the UBM minimises the assumptions made about the active masker sources. Such a system is potentially more suitable for an attention-driven model of sound localisation, in which the attended target source may be switched and the localisation weights can be dynamically recomputed in order to localise the newly attended source.

V-B Effect of unseen masker types

In Noise Set A, when the identity of the masker was assumed unknown, the UBM was used as the background model. However, the parameters of the UBM were estimated using all the masker types from Noise Set A. To evaluate how well the system could generalise to unseen masker types, Noise Set B was excluded from training the UBM. Comparing the UBM results for Noise Sets A and B, it is clear that using a UBM the system did not improve the target localisation accuracy for Noise Set B as much as for Noise Set A. This is especially the case for narrowband maskers. In the unseen ‘phone’ condition, the average target localisation error rate was reduced from 41% to 31% at the TMR of 0 dB, and was reduced from 55% to 42% at the TMR of -6 dB. In contrast, in the ‘alarm’ condition, the localisation error rate was reduced from 14% to 2% at 0 dB TMR and from 30% to 12% at -6 dB TMR. The error reduction was more similar between the 16-talker and 32-talker conditions, probably because the two speech babbles are more similar in spectral shapes. This suggests that the UBM worked less effectively for masker conditions that were not used for training the UBM, but it is still beneficial to use the UBM in unseen masker conditions.

V-C Effect of background-model adaptation

As demonstrated by the results, background-model adaptation had a major contribution in reducing the target localisation error rates across various conditions. At the TMR of 0 dB the energy level mismatch between the trained background models and the test signals was minimal, but the adaptation process was still beneficial for Noise Set A as it can accommodate the level difference of individual test signals. The improvement was larger when the background model was the correct masker model (as opposed to the UBM), particularly for masker types that are difficult to model, such as ‘engine’ (many unpredictable events) and ‘16-talker babble’ (less stationary). For Noise Set B, the benefit of adaptation is more apparent, especially for the narrowband ‘phone’ condition. This suggests that the adaptation process was also able to adapt the model parameters to match better the spectral profile of an unseen masker.

Background-model adaptation benefitted the systems more at the TMR of -6 dB where there is a mismatch between the model level and the test signal level. Similarly to the 0 dB TMR case, the improvement appeared larger when the correct masker model was used as the background model. This is likely because with the correct masker model only the energy level needs to be adapted, whereas with the UBM the spectral profile also needs to be adapted. As the model parameters were adapted on-the-fly using a short test signal (less than 2-sec long) that was mixed with a masker sound, it is more difficult to reestimate the model parameters correctly.

To illustrate the effect of various stages in the proposed system, Fig. 6 shows an example of the estimated localisation weights. Here the target speech was mixed with an alarm sound at a TMR of -6 dB. The UBM was used when estimating the localisation weights. The ‘oracle’ mask shows the spectro-temporal regions dominated by the target speech using a priori information of the pre-mixed signals. The localisation weights estimated without adaptation bear some resemblance to the oracle mask, but incorrectly give more weight to the high frequency regions above 2 kHz that are dominated by the masker. With adaptation these masker-dominated regions are given less weight, and the estimated mask is closer to the oracle mask.

VI Conclusions

This paper has presented a novel computational framework for binaural sound localisation that combines data-driven and model-based information flow. By jointly exploiting model-based knowledge about the source spectral characteristics in the acoustic scene, the system is able to selectively weight binaural cues in order to more reliably localise the attended source. To address the mismatch between the training and testing conditions, a model adaptation process is employed on-the-fly which re-estimates the background model parameters directly from the noisy observations in an iterative manner. Evaluation using masker sources with varying spectro-temporal complexity, including masker types that are not seen during the training stage, showed that by exploiting source models in this way, sound localisation performance can be improved substantially under conditions where multiple sources and room reverberation are present.

One of the advantages of the proposed system is that information fusion across frequency bands is done after the DNNs estimate the azimuth posteriors. The late fusion of information allows use of single-source data for training, as otherwise the amount of required training data would increase exponentially with the number of sources. In this way, the framework is also able to generalise to localisation of other sounds and is not tailored to the type of sound used for training.

The current evaluation involved only two sound sources (a target source and an interfering source). In general, the proposed approach is able to generalise to the scenario where there are more than one interfering source. In this case, one would consider all interfering sources as a single masker and use the universal background model to model the combination of all interfering sources. The complexity of the framework stays the same in this case.

The proposed sound localisation framework could be combined with source identification in order to estimate the identity of the target source that the system ‘attends’ to. Such an attention-driven model could be used to localise an attended source whose identity is not available a priori, e.g. a talker that speaks a keyword in an acoustic mixture. Source localisation and source identification could then interact in an ongoing iterative process. In addition, the framework described here could be integrated with an approach that uses source models to ‘perceptually restore’ parts of the target sound that have been masked [33]. Another future direction is the extension to cross-modal control. The proposed system is a general framework through which other modalities could also be incorporated, such as a vision system on a mobile robot.

Bibliography33

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] A. Bregman, Auditory Scene Analysis . Cambridge, MA: MIT Press, 1990.
2[2] D. L. Wang and G. J. Brown, Eds., Computational Auditory Scene Analysis: Principles, Algorithms and Applications . Wiley/IEEE Press, 2006.
3[3] G. C. Spence and J. Driver, “Covert spatial orienting in audition: exogenous and endogenous mechanisms,” Journal of Experimental Psychology , vol. 20, pp. 555–574, 1994.
4[4] A. Bronkhorst, “The cocktail-party problem revisited: early processing and selection of multi-talker speech,” Attention, Perception, and Psychophysics , vol. 77, no. 5, pp. 1465–1487, 2015.
5[5] B. H. Gaese and H. Wagner, “Precognitive and cognitive elements in sound localization,” Zoology , vol. 105, pp. 329–339, 2002.
6[6] D. E. Winkowski and E. I. Knudsen, “Top-down gain control of the auditory space map by gaze control circuitry in the barn owl.” Nature , vol. 439, no. 7074, pp. 336–339, 2006.
7[7] N. Ma, T. May, and G. J. Brown, “Exploiting deep neural networks and head movements for robust binaural localisation of multiple sources in reverberant environments,” IEEE Trans. Audio, Speech, Lang. Process. , vol. 25, no. 12, pp. 2444–2453, 2017.
8[8] H. Christensen, N. Ma, S. Wrigley, and J. Barker, “A speech fragment approach to localising multiple speakers in reverberant environments,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. , Taipei, 2009, pp. 4593–4596.