Time Domain Audio Visual Speech Separation

Jian Wu; Yong Xu; Shi-Xiong Zhang; Lian-Wu Chen; Meng Yu; Lei Xie,; Dong Yu

arXiv:1904.03760·eess.AS·September 24, 2019

Time Domain Audio Visual Speech Separation

Jian Wu, Yong Xu, Shi-Xiong Zhang, Lian-Wu Chen, Meng Yu, Lei Xie,, Dong Yu

PDF

Open Access

TL;DR

This paper presents a novel time-domain audio-visual speech separation architecture that extends previous models to improve target speaker extraction by integrating lip video data, achieving significant Si-SNR improvements.

Contribution

It introduces a new multi-modal time-domain architecture for speech separation that combines audio and visual cues, extending classical methods from frequency to time domain.

Findings

01

Achieves over 3dB Si-SNR improvement in two-speaker cases.

02

Achieves over 4dB Si-SNR improvement in three-speaker cases.

03

Outperforms audio-only and frequency-domain audio-visual models.

Abstract

Audio-visual multi-modal modeling has been demonstrated to be effective in many speech related tasks, such as speech recognition and speech enhancement. This paper introduces a new time-domain audio-visual architecture for target speaker extraction from monaural mixtures. The architecture generalizes the previous TasNet (time-domain speech separation network) to enable multi-modal learning and at meanwhile it extends the classical audio-visual speech separation from frequency-domain to time-domain. The main components of proposed architecture include an audio encoder, a video encoder that extracts lip embedding from video streams, a multi-modal separation network and an audio decoder. Experiments on simulated mixtures based on recently released LRS2 dataset show that our method can bring 3dB+ and 4dB+ Si-SNR improvements on two- and three-speaker cases respectively, compared to…

Figures1

Click any figure to enlarge with its caption.

Tables6

Table 1. Table 1 : Number of utterances used to simulate each dataset for 2 and 3-speaker mixtures

Dataset	#utts	Source	#simu
training	16445	training	40k
validation	3000	training	5k
test	740	validation+test	3k

Table 2. Table 2 : Results of oracle TF-mask and audio-only methods on 2 and 3-speaker test set

Method	Configuration	#Param	Si-SNR
Method	Configuration	#Param	2spkr	3spkr
Mixed	-	-	0.01	-3.33
Oracle PSM	512/256/hann	-	14.43	11.54
Oracle IRM	512/256/hann	-	11.33	8.18
uPIT-BLSTM²²2https://github.com/funcwj/setk/tree/master/egs/upit	3 $\times$ 600	$\sim$ 22M	7.13	1.20
Conv-TasNet³³3https://github.com/funcwj/conv-tas-net	gLN	$\sim$ 13M	10.58	5.67

Table 3. Table 3 : Results of the model with different lip embeddings

Dataset	#Param	Embeddings	Si-SNR
Dataset	#Param	Embeddings	BN	gLN
2spkr	10.09M	word	12.53	13.01
2spkr	10.09M	CI-phone	12.76	13.04
3spkr	10.09M	word	7.52	8.25
3spkr	10.09M	CI-phone	8.36	9.41

Table 4. Table 4 : Results of tuning number of fusion blocks

Dataset	$N_{f}$	Embeddings	Si-SNR
2spkr	2	CI-phone	13.04
2spkr	3	CI-phone	13.33
3spkr	2	CI-phone	9.41
	3		9.50
	4		9.30
3spkr	2	CD-phone	9.23
	3		9.47
	4		9.06

Table 5. Table 5 : Results of frequency-domain audio-visual models

Dataset	Embeddings	#Param	Phase	Si-SNR
2spkr	CI-phone	10.03M	mix	10.36
2spkr	CI-phone	10.03M	oracle	13.87
3spkr	CI-phone	10.03M	mix	6.16
3spkr	CI-phone	10.03M	oracle	9.65

Table 6. Table 6 : Results of models with multi-speaker training

Training data	$N_{f}$	Norm	Si-SNR
Training data	$N_{f}$	Norm	2spkr	3spkr
2spkr	2	BN	12.76	-
3spkr			11.52	8.36
{2,3}spkr			13.00	8.38
2spkr	2	gLN	13.04	-
3spkr			12.74	9.41
{2,3}spkr			13.65	9.40
2spkr	3	gLN	13.33	-
3spkr			12.86	9.50
{2,3}spkr			14.02	9.92

Equations25

Si-SNR = 20 lo g_{10} \frac{∥ α \cdot s _{t} ∥}{∥ s _{e} - α \cdot s _{t} ∥} .

Si-SNR = 20 lo g_{10} \frac{∥ α \cdot s _{t} ∥}{∥ s _{e} - α \cdot s _{t} ∥} .

α = s_{e}^{T} s_{t} / s_{t}^{T} s_{t} .

α = s_{e}^{T} s_{t} / s_{t}^{T} s_{t} .

s_{i} = Decoder (w_{x} ⊙ m_{i}),

s_{i} = Decoder (w_{x} ⊙ m_{i}),

w_{x}

w_{x}

[m_{0}, \dots, m_{N - 1}]

v_{e}

v_{e}

m_{t}

s_{t} = Decoder (w_{x} ⊙ m_{t}) .

s_{t} = Decoder (w_{x} ⊙ m_{t}) .

Encoder^{a} (x)

Encoder^{a} (x)

Decoder (x)

a_{e} = Conv_{s} (\dots Conv_{s} (w_{x})) N_{a}

a_{e} = Conv_{s} (\dots Conv_{s} (w_{x})) N_{a}

f = P ([a_{e}; Upsample (v_{e})])

f = P ([a_{e}; Upsample (v_{e})])

m_{t} = σ (Conv_{s} (\dots Conv_{s} (f)) N_{f}) .

m_{t} = σ (Conv_{s} (\dots Conv_{s} (f)) N_{f}) .

L = ∥ s_{t} ⊙ max {cos (∠ s_{m} - ∠ s_{t}), 0} - s_{m} ⊙ m_{t} ∥_{2}^{2},

L = ∥ s_{t} ⊙ max {cos (∠ s_{m} - ∠ s_{t}), 0} - s_{m} ⊙ m_{t} ∥_{2}^{2},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Advanced Adaptive Filtering Techniques

Full text

Time Domain Audio Visual Speech Separation

Abstract

Audio-visual multi-modal modeling has been demonstrated to be effective in many speech related tasks, such as speech recognition and speech enhancement. This paper introduces a new time-domain audio-visual architecture for target speaker extraction from monaural mixtures. The architecture generalizes the previous TasNet (time-domain speech separation network) to enable multi-modal learning and at meanwhile it extends the classical audio-visual speech separation from frequency-domain to time-domain. The main components of proposed architecture include an audio encoder, a video encoder that extracts lip embedding from video streams, a multi-modal separation network and an audio decoder. Experiments on simulated mixtures based on recently released LRS2 dataset show that our method can bring 3dB+ and 4dB+ Si-SNR improvements on two- and three-speaker cases respectively, compared to audio-only TasNet and frequency-domain audio-visual networks.

Index Terms: audio-visual speech separation, speech enhancement, TasNet, multi-modal learning

1 Introduction

The goal of speech separation is to separate each source speaker from the mixture signal. Although it has been studied for many years, speech separation is still a difficult problem, especially in noisy and reverberated environment. Several audio-only speech separation methods were recently proposed, such as uPIT [1], DPCL [2, 3], DANet [4] and TasNet [5]. However, in these approaches, the number of target speakers has to be known as a prior information and assumed to be unchanged during training and testing. In addition, the separation results of these systems cannot be associated to the speakers, which greatly limits their application scenarios.

If we can extract some target speaker dependent features, the task of speech separation will become the target speaker extraction problem. This has several clear advantages over the blind separation approaches. First, as the model only extracts one target speaker each time from the mixture, the prior knowledge about the number of speakers is no longer needed. Second, as the target speaker features were given to the model, the issue of label permutation is apparently avoided. These make separation of the target speaker more practical than the blind separation solutions.

Several target speaker separation approaches have been explored in the past [6, 7, 8]. In [8], the authors proposed a system named VoiceFilter that used d-vectors [9] as the embedding of target speaker for the separation network. Similarly [7] used a short anchor utterance as auxiliary input for target speaker separation. However, speaker or utterance based features are not robust enough, which could heavily affect system performance, especially when noise exists or same-gender speakers are mixed.

Alternatively, the visual information is acoustic noise insensitive and highly correlated to the speech content. Combining the audio and visual information has previously been investigated for automatic speech recognition (ASR) [10, 11, 12], speech enhancement [13, 14] and speech separation [15, 16]. The results have shown great potential of making use of visual features as complementary information source. For speech separation, [13] proposed an encoder-decoder model architecture to separate the voice of a visible speaker from background noise. The noisy spectrogram and center cropped video frames were used as the input of the audio/visual encoder respectively. [15] built a deeper network via stacking depth-wise separable convolution blocks, which includes a magnitude network to estimate Time-Frequency (TF) masks [17] and a phase network to predict clean phase from the mixture phase. The magnitude network used pre-trained lip embeddings [18] as visual features. Similarly [16] built a model based on dilated convolutions and LSTMs using face embeddings as visual features and yielded good results on a large scale audio-visual dataset.

Most of previous audio-visual separation systems have handled the audio stream in TF domain and thus the accuracy of the estimated TF-mask is the key to the success of those systems. On the other hand, the phase of signals is less considered, although it can significantly affect separation quality [19, 20, 21, 22]. To incorporate the phase information, [15] used a phase subnet to refine original noisy phase, and [16] adopted newly proposed complex masks [20] instead of traditional real masks (e.g., IRM [1], PSM [23] etc). Recently, [5, 24] proposed a new encoder-decoder framework named TasNet, directly separating speech on time-domain with impressive results on public WSJ0-2mix dataset. In this paper, we propose a new separation network that generalizes the TasNet to enable multi-modal fusion of auditory and visual signals. Given a raw waveform of mixture speech and the corresponding video stream of the target speaker, our audio-visual speech separation model can extract audio of the the target speaker directly.

The contribution of this work includes: 1) A new structure for multi-modal speech separation is proposed and to illustrate the effectiveness of the structure, a comprehensive comparison 111Audio samples of the compared systems can be checked out at https://funcwj.github.io/online-demo/page/tavs is performed with three typical separation models, uPIT [1] (frequency-domain audio-only), Conv-TasNet [5] (time-domain audio-only) and Conv-FavsNet (frequency-domain audio-visual). 2) To the best of our knowledge, this is the first work that performs audio-visual separation directly on the time-domain. Experiments on recently released in-the-wild videos [11] show that the proposed structure brings significant improvements compared to all other baseline models. 3) Previous visual features are not well designed for speech separation. In this work, the visual (lip) embeddings are specifically trained to represent the phonetic information. Different modeling units for the embedding network, such as words, phonemes (CI-phone) and context-dependent phones (CD-phone), are also investigated and compared.

2 Proposed system

This section will introduce the architecture of our proposed time-domain audio-visual speech separation network. Generally speaking, the network is fed with chunks of raw waveform and corresponding video frames, and predicts the speech of target speaker directly, as depicted in Figure 1. Scale-invariant source-to-noise ratio (Si-SNR) is used as the training objective function [5], which is defined as

[TABLE]

$\mathbf{s}_{e},\mathbf{s}_{e}$ are estimated signal and target source respectively and are normalized to zero mean, where $\alpha$ is an optimal scaling factor computed via

[TABLE]

We also use Si-SNR as the evaluation metric, which is thought to be a more robust measure of separation quality compared to the original SDR [25].

2.1 Overview

The proposed structure is mainly inspired by the TasNet structure proposed in [24], which contains three parts, an audio encoder/decoder and a separation network. Given an audio mixture chunk $\mathbf{x}=\{x_{0},x_{1},\cdots,x_{C}\}$ ( $C$ means the chunk size), the audio encoder in the TasNet tries to encode mixture samples as some non-negative vector sequences $\mathbf{w}_{x}$ , while the separation network estimates masks of each source defined on such space. The audio decoder is used to reconstruct each masked results into time-domain again. The total framework could be described as

[TABLE]

where

[TABLE]

$\odot$ denotes the Hadamard product. $N$ is the number of the speaker sources in the mixture. $\mathbf{m}_{i}$ and $\mathbf{s}_{i}$ represent the estimated masks and separation results of the $i$ -th speaker, respectively.

When the video stream is available, our motivation is to bias the separation network through feeding both audio representations $\mathbf{w}_{x}$ and video encoded features $\mathbf{v}_{e}$ . Thus the separator only generates masks $\mathbf{m}_{t}$ of the target speaker, which could be written as

[TABLE]

$\mathbf{v}_{t}$ means the image frame sequences of the target speaker, and $\text{Encoder}^{a/v}$ represent the audio/video encoder. Finally, separated results $\mathbf{s}_{t}$ could be obtained through the formula below:

[TABLE]

2.2 Video encoder

Video encoder is designed to extract the visual features $\mathbf{v}_{e}$ from input image frame sequences. In our experiments, we only use the lip region because it encodes the context and phonetic information we need. Generally, the video encoder contains a lip embedding extractor, followed by several temporal convolutional blocks, as depicted in Figure 1 .

The lip embedding extractor consists of a 3D convolution layer and a 18-layer ResNet [26], similar to the work in [15]. The extractor outputs the fixed dimensional feature vectors $\mathbf{l}_{e}$ for each video frame and $\mathbf{l}_{e}$ is passed through the following temporal convolutional blocks. Each block consists of a temporal convolution, preceded by ReLU activation and batch normalization [27]. The residual connection is also included, although they do not have significant impact according to our results. To reduce the model parameters, we use depth-wise separable convolution [28] instead. In our setups, we use a 256 dimensional lip embeddings and choose 3 kernel size, 1 stride size and 512 channel size for all the blocks.

The lip embedding extractor is pre-trained separately with specific back-end, following the steps introduced in [18]. After that, the extractor is kept frozen. Apart from the word-level classification target used in [15], we also try the phoneme (CI-phone) and context dependent phone (CD-phone) from a view of ASR to improve the quality of the lip embeddings.

2.3 Audio encoder/decoder

Audio encoder and decoder perform the 1D convolution and deconvolution operation on the mixed audio signals $\mathbf{x}$ and the masked encoded sequences, respectively. They can be represented as:

[TABLE]

$K$ and $S$ denote size of kernel and stride in 1D convolution operation, respectively. In this paper, we use $K=40$ and $S=20$ by default.

2.4 Separation network

Separation network is designed for estimating masks $\mathbf{m}_{t}$ of the target speaker, conditioned on the encoded audio and visual features. It is stacked by several temporal dilated convolutional blocks and provides a simple mechanism to handle feature fusion and synchronization. The structure of the 1D dilated convolutional block in separation network is similar to the one used in Conv-TasNet. In this paper, we denote it as $\text{Conv}_{s}$ . Each $\text{Conv}_{s}$ has D sub-blocks with the exponential growth dilation factors $2^{d}$ , where $d\in\{0,\dots,D-1\}$ , as shown in Figure 1. In our setups, we use $D=8$ .

The output of the audio encoder $\mathbf{w}_{x}$ are firstly passed through $N_{a}$ convolutional blocks

[TABLE]

and then fused with visual features $\mathbf{v}_{e}$ . The fusion process is performed through a simple concatenation operation over the convolution channel dimensions, followed by a position-wise projection $\mathcal{P}$ to reduce the feature dimension. In order to synchronize the time resolution of audio and video features, up-sampling is done on video streams before concatenation if it’s necessary. The description above could be written as:

[TABLE]

Finally, the fused features $\mathbf{f}$ are fed through $N_{f}$ convolutional blocks and target masks $\mathbf{m}_{t}$ are estimated:

[TABLE]

$\sigma$ here means an arbitrary non-linear function. We use ReLU in our experiments, without loss of generality.

We mainly utilize two normalization techniques in the separation network, batch normalization (BN) and global layer normalization (gLN). gLN normalizes the features on both time and channel dimensions. Although the modules applied gLN can not be a causal system, it brings better performance in the practice of the Conv-TasNet.

3 Experiments and results

3.1 Dataset

In the experiment, we created two-speaker and three-speaker mixtures using utterances from Oxford-BBC Lip Reading Sentences 2 (LRS2) dataset [11], which consists of thousands of spoken sentences from BBC television with their corresponding transcriptions. The training, validation and test sets are generated according to the broadcast date, and thus those sets are not overlapped.

Short utterances (less than 2s) are dropped and 21075 utterances in total are used for data generation, with 19445 for training and the rest for validation and testing, respectively. The details of utterances used for simulation are summarized in Table 1. Two- and three-speaker mixtures are generated by randomly selecting different utterances and mixing them at various signal-to-noise ratios (SNR) between -5 dB and 5 dB. The sampling rate is 16kHz. To ensure the videos of each source are available in a mixture, longer sources are truncated to be aligned with the shortest one. The source segment is synchronous with the video stream in 25 fps. Finally, we simulated 40k (25h+ in total), 5k and 3k utterances for training, validation and test set, respectively.

3.2 Training details

Similar to [15], we first train the lip embedding extractor on the LRW dataset. This is a word-level classification task and we achieve 76.02% classification accuracy on the test set [18]. However, the LRW dataset does not provide utterance-level audio transcripts, which makes it unfeasible to replace the word-level training targets with smaller pieces, e.g., CD/CI-phones mentioned above. Instead, we choose LRS2’s pre-train set to train the phone-level lip embedding extractor.

The alignments are derived from the GMM acoustic models following the Kaldi’s [29] recipes and are sub-sampled to the video sampling rate before training. We choose 44 units from CMU dictionary for CI-phones and get 3048 units from alignments set for CD-phones. The video frames are transformed to grayscale and normalized with respect to the global mean and variance. The training progress is similar to the word-level task, except using frame-level cross-entropy loss instead of sequence-level.

The audio-visual network is trained with 2s audio/video chunks using Adam [30] optimizer for 80 epochs with early stopping when there is no improvement on validation loss for 6 epochs. Initial learning rate is set to $1e^{-3}$ and halved during training if there is no improvement for 3 epochs on validation loss.

3.3 Results and comparisons

On both datasets, we first report the results of two typical oracle masks as well as two conventional audio based methods: uPIT-BLSTM and Conv-TasNet [5] on frequency and time-domain, respectively. The results of oracle mask are the upper-bound of the corresponding mask-based methods. uPIT-BLSTM adopts a 3-layer Bi-LSTM structure and the PSM (phase sensitive mask [23]) is used as training target. For Conv-TasNet, we choose $K=40,S=20$ in audio encoder/decoder and a larger bottleneck size (384) in separation network in order to bring better performance. Results are shown in Table 2. It’s worth to mention that the audio streams in LRS2 are not as clean as WSJ0, so the results reported on those audio-only models are worse than previous results on WSJ0-2mix, especially when the number of speakers increases.

3.3.1 Results with different lip embeddings

In Table 3, we evaluate the proposed audio-visual models with the word-level lip embeddings trained on the LRW dataset, which already leads to significant improvements compared with the audio-only methods in Table 2. As the resolutions of the video frames on LRS2 dataset do not match with LRW, we crop the center 70 $\times$ 70 pixel region of the images and then re-sample them to 112 $\times$ 112. Replacing word-level targets with CI-phones leads to better results, particularly on difficult tasks, i.e., three-speaker test set as shown in Table 3.

Normalization techniques are also critical in our experiments. In Table 3, replacing batch normalization (BN) with global layer normalization (gLN) leads to improvement on both two and three-speaker mixture datasets. In our preliminary experiments, the residual connection in the video block affects less on the final results. And by increasing the number of blocks in video encoders, no significant improvement is achieved, possibly due to the well trained visual features.

3.3.2 Impact of separation networks

The final results largely depend on the target speaker masks produced by the separation network. In addition to the normalization techniques mentioned in Section 3.3.1, we also tuned the number of $N_{a}$ and $N_{f}$ in the separation network. By fixing the total number of blocks (4 used in this work), increasing the value of $N_{f}$ brings more context information of fused features. Table 4 illustrates that using $N_{a}=1$ and $N_{f}=3$ achieves the best performance.

Based on above discussion, we further compared CD/CI-phones on three-speaker dataset with different pair of $N_{a}$ and $N_{f}$ . Results show that using CD-phone as training targets brings slightly worse result, although they are more suitable for acoustic modeling task. We therefore choose CI-phone in the following experiments.

3.3.3 Comparision with the frequency-domain networks

To further investigate the advances of the time-domain audio-visual approach, we trained a frequency-domain audio-visual separation model with a similar parameter size on the same dataset for comparison. We call it Conv-FavsNet in this paper. Conv-FavsNet removes the audio encoder from the proposed architecture and replaces the decoder with a linear layer, which transforms the output of convolutional blocks to TF-masks. We use linear spectrogram computed with 40ms hanning window and 10ms shift as input audio features and PSM as training target. The loss function for phase-sensitive spectrum approximation (PSA) is defined as:

[TABLE]

where $\mathbf{m}_{t}$ denotes the estimated target speaker masks, and $\mathbf{s}_{t},\mathbf{s}_{m}$ denote the magnitude of target speaker and mixture signal, respectively. Results are shown in Table 5. The noisy phase used in iSTFT certainly affects the quality of the reconstructed waveform as oracle phase can bring additional 3dB Si-SNR improvement which thus matches our time-domain audio-visual results.

3.3.4 Multi-speaker training

For the biased/informed separation system, we find that models trained on hard tasks can work well on easy tasks. For example, in our experiments, the model trained on the three-speaker dataset also performs well on the two-speaker mixture signals. This motivates us to train models with a blend of two and three-speaker mixtures, which is called multi-speaker training in [3].

In fact, the architecture of the target isolating network is independent of the number of speakers. The label of training samples is determined by auxiliary features, e.g., lip embeddings in this work. This is applicable to the real-world scenarios because at most time the number of speakers is difficult to be detected. The performace of the models trained on two-speaker mixtures only, on three-speaker mixtures only, and using the multi-speaker training, are shown in Table 6. The model trained on three-speaker mixtures could generalize to two-speaker scenarios, but a little worse than the performance of the two-speaker model. The multi-speaker training brings improvement on two-speaker test set across all the setups in Table 6. The best configurations with multi-speaker training achieves 14.02dB and 9.92dB Si-SNR on two and three-speaker test sets, respectively.

4 Conclusions

In this paper, we have proposed a time-domain audio-visual target speech separation architecture incorporating the raw waveform and the target speaker’s video stream to predict the target speech waveform directly. Lip embedding extractor is pre-trained for extracting movement information from the video streams. We find that word-level and phoneme-level lip embeddings effectively benefit the separation network. Compared with the audio-only methods and the frequency-domain audio-visual method, the proposed approach improves more than 3dB and 4dB in terms of Si-SNR on two and three-speaker test sets, respectively.

Bibliography30

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Morten Kolbæk, Dong Yu, Zheng-Hua Tan, Jesper Jensen, Morten Kolbaek, Dong Yu, Zheng-Hua Tan, and Jesper Jensen, “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) , vol. 25, no. 10, pp. 1901–1913, 2017.
2[2] John R Hershey, Zhuo Chen, Jonathan Le Roux, and Shinji Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2016, pp. 31–35.
3[3] Yusuf Isik, Jonathan Le Roux, Zhuo Chen, Shinji Watanabe, and John R Hershey, “Single-channel multi-speaker separation using deep clustering,” ar Xiv preprint ar Xiv:1607.02173 , 2016.
4[4] Zhuo Chen, Yi Luo, and Nima Mesgarani, “Deep attractor network for single-microphone speaker separation,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2017, pp. 246–250.
5[5] Yi Luo and Nima Mesgarani, “Tasnet: time-domain audio separation network for real-time, single-channel speech separation,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2018, pp. 696–700.
6[6] Katerina Zmolikova, Marc Delcroix, Keisuke Kinoshita, Takuya Higuchi, Atsunori Ogawa, and Tomohiro Nakatani, “Speaker-aware neural network based beamformer for speaker extraction in speech mixtures.,” in Interspeech , 2017, pp. 2655–2659.
7[7] Jun Wang, Jie Chen, Dan Su, Lianwu Chen, Meng Yu, Yanmin Qian, and Dong Yu, “Deep extractor network for target speaker recovery from single channel speech mixtures,” ar Xiv preprint ar Xiv:1807.08974 , 2018.
8[8] Quan Wang, Hannah Muckenhirn, Kevin Wilson, Prashant Sridhar, Zelin Wu, John Hershey, Rif A Saurous, Ron J Weiss, Ye Jia, and Ignacio Lopez Moreno, “Voicefilter: Targeted voice separation by speaker-conditioned spectrogram masking,” ar Xiv preprint ar Xiv:1810.04826 , 2018.