Locale Encoding For Scalable Multilingual Keyword Spotting Models

Pai Zhu; Hyun Jin Park; Alex Park; Angelo Scorza Scarpati; Ignacio; Lopez Moreno

arXiv:2302.12961·cs.CL·February 28, 2023

Locale Encoding For Scalable Multilingual Keyword Spotting Models

Pai Zhu, Hyun Jin Park, Alex Park, Angelo Scorza Scarpati, Ignacio, Lopez Moreno

PDF

Open Access

TL;DR

This paper introduces locale-conditioned universal models for multilingual keyword spotting, significantly improving accuracy and reducing false rejection rates across multiple languages and noise conditions.

Contribution

It proposes two novel locale encoding methods, including FiLM, to enhance multilingual KWS performance over traditional monolingual and universal models.

Findings

01

FiLM achieved 61% relative reduction in false rejection rate.

02

Locale-conditioned models outperform baseline methods across all tested languages.

03

Models maintain high accuracy in noisy environments.

Abstract

A Multilingual Keyword Spotting (KWS) system detects spokenkeywords over multiple locales. Conventional monolingual KWSapproaches do not scale well to multilingual scenarios because ofhigh development/maintenance costs and lack of resource sharing.To overcome this limit, we propose two locale-conditioned universalmodels with locale feature concatenation and feature-wise linearmodulation (FiLM). We compare these models with two baselinemethods: locale-specific monolingual KWS, and a single universalmodel trained over all data. Experiments over 10 localized languagedatasets show that locale-conditioned models substantially improveaccuracy over baseline methods across all locales in different noiseconditions.FiLMperformed the best, improving on average FRRby 61% (relative) compared to monolingual KWS models of similarsizes.

Tables2

Table 1. Table 1 : FRRs (%) of different models in 10 locale datasets with regular (Eval-reg) and challenging (Eval-chall) acoustic conditions, from the 330K params model M R subscript 𝑀 𝑅 M_{R} . The thresholds are chosen to have the same targeted FAh (0.17) in the negative audio set.

Eval-reg	Locale-name	DA_DK	DE_DE	ES_ES	FR_FR	IT_IT	KO_KR	NL_NL	PT_BR	SV_SE	TH_TH	AVERAGE
	Locale Specific Models	22.61	2.49	5.67	8.47	6.09	11.57	5.81	6.92	19.35	3.10	9.21
	Universal Model	6.15	4.66	10.89	10.66	6.34	10.05	4.69	7.58	14.89	6.64	8.26
	Locale $𝙲𝚘𝚗𝚌𝚊𝚝$ Model	1.92	2.97	11.16	14.07	3.95	5.29	1.18	2.53	4.79	3.13	5.10
	Locale $𝙵𝚒𝙻𝙼$ Model	1.62	2.16	4.65	7.26	3.25	6.29	2.58	2.38	4.20	1.66	3.60
Eval-chall	Locale Specific Models	51.19	24.65	28.84	38.45	29.75	55.52	19.57	31.99	52.60	10.09	34.26
	Universal Model	44.05	45.17	48.73	50.12	31.38	46.33	20.37	35.92	48.64	34.20	40.49
	Locale $𝙲𝚘𝚗𝚌𝚊𝚝$ Model	16.39	28.47	40.81	43.75	20.41	31.55	6.21	13.41	24.83	16.25	24.21
	Locale $𝙵𝚒𝙻𝙼$ Model	12.33	17.17	21.82	20.65	16.37	30.47	10.52	13.51	20.72	8.53	17.21

Table 2. Table 2 : FRRs (%) of different models in 10 locale datasets with regular (Eval-reg) and challenging (Eval-chall) acoustic conditions, from the 1.4M params model M L subscript 𝑀 𝐿 M_{L} . The thresholds are chosen to have the same targeted FAh (0.17) in the negative audio set.

Eval-reg	Locale-name	DA_DK	DE_DE	ES_ES	FR_FR	IT_IT	KO_KR	NL_NL	PT_BR	SV_SE	TH_TH	AVERAGE
	Locale Specific Models	9.74	2.58	4.63	7.09	4.72	7.97	8.59	6.66	7.80	6.70	6.65
	Universal Model	3.59	3.29	7.47	8.13	4.27	6.23	2.19	4.31	8.44	4.87	5.28
	Locale Concat Model	1.64	2.03	5.41	6.31	2.78	7.32	0.85	2.25	4.51	1.76	3.49
	Locale FiLM Model	1.94	1.51	5.81	4.93	1.93	4.50	1.02	1.79	3.37	1.53	2.83
Eval-chall	Locale Specific Models	28.84	25.76	25.94	34.68	24.80	46.91	22.53	31.38	25.84	22.31	28.90
	Universal Model	28.54	30.79	37.61	32.11	22.90	32.20	11.29	21.26	32.96	24.45	27.41
	Locale Concat Model	14.65	16.27	22.61	20.62	13.72	28.94	3.80	11.09	19.40	9.35	16.04
	Locale FiLM Model	14.05	14.05	23.68	17.96	10.92	23.09	3.83	9.48	18.15	7.24	14.24

Equations11

θ_{l} =

θ_{l} =

where (x, y) \in (X_{l}, Y_{l}), l \in {1.. N}

θ_{univ} =

θ_{univ} =

where (x, y) \in l = 1.. N ⋃ (X_{l}, Y_{l})

θ_{cond} =

θ_{cond} =

where (x, y) \in (X_{l}, Y_{l}), l \in {1.. N}

P^{mod} = FiLM (P ∣ γ, β) = γ \cdot P + β

P^{mod} = FiLM (P ∣ γ, β) = γ \cdot P + β

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Topic Modeling · Natural Language Processing Techniques

Full text

Locale Encoding for scalable multilingual keyword spotting models

Abstract

A Multilingual Keyword Spotting (KWS) system detects spoken keywords over multiple locales. Conventional monolingual KWS approaches do not scale well to multilingual scenarios because of high development/maintenance costs and lack of resource sharing. To overcome this limit, we propose two locale-conditioned universal models with locale feature concatenation and feature-wise linear modulation ( ${\tt FiLM}$ ). We compare these models with two baseline methods: locale-specific monolingual KWS, and a single universal model trained over all data. Experiments over 10 localized language datasets show that locale-conditioned models substantially improve accuracy over baseline methods across all locales in different noise conditions. ${\tt FiLM}$ performed the best, improving on average FRR by 61% (relative) compared to monolingual KWS models of similar sizes.

Index Terms— Multilingual Keyword detection, Keyword spotting, Locale Encoding, Locale Conditioning.

1 Introduction

Production-grade keyword-spotting (KWS) systems are trained to recognize keywords from a continuous stream of speech. They operate in a resource-constrained, noisy environment.

Previously, research on this problem has focused on issues like noise robustness, reducing dependency on data volume and label quality, minimizing computing cost, and improving detection accuracy [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]. Most of the cited research addresses keyword spotting in a single specific language (locale). But for production-grade systems, it is highly important to scale up systems to support numerous international languages. Similar to efforts in multilingual speech recognition [13] and multilingual speaker recognition [14], a universal multilingual KWS model will not only drastically reduce the cost of training, but also largely simplifies the model deployment process and maintenance cost.

In this paper, we discuss and explore a scalable approach to creating KWS models that cover numerous international languages while minimizing the cost of development at reasonable quality.

For the multilingual keyword spotting problem, we want to develop localized model(s) which can detect a desired keyword in various languages. A naïve approach is to just develop a monolingual KWS model, and repeat the same process for other languages, simply switching out the training data and localized keyword. This process will yield a set of $N$ locale specific models given $N$ locales. It can serve as a simple baseline, with the drawback of having high maintenance costs and limited use of shared linguistic information across all of the training data. Locale-specific data pre-processing and model training costs scale linearly with the number of locales, which can be prohibitive for tens or hundreds of locales. Moreover, many common properties of acoustic data are likely helpful across all locales, and are not exploited by the repeated monolingual approach.

To overcome these limitations, we consider three new approaches for sharing information between locales, while minimizing development costs relative to the baseline. First, we consider a fully universal model, which is a single model trained with union of data from all locales. Second, we propose two locale-conditioned universal models based on different conditioning methods: ${\tt Concat}$ and ${\tt FiLM}$ . With the ${\tt Concat}$ method, a locale encoding is concatenated to the output of intermediate layer connecting encoder and decoder. With the ${\tt FiLM}$ (Feature-wise Linear Modulation) [15] method, a locale encoding is used to modulate the same intermediate layer output. A locale conditioned universal model is a single model trained with all locales data together, but requiring the locale identity as an auxiliary input. We compare these locale conditioned multilingual KWS methods with monolingual individual KWS method and fully shared universal model. The rest of the paper is organized as follows: we discuss related works in Section 2, and describe the proposed approach in Section 3. We discuss the experimental setup and the result in Sections 4 and 5, and conclude in Section 6.

2 Related Work and Background

2.1 Related Work

In [16], multilingual KWS was explored by merging acoustically similar phonemes from two languages, and building a shared phoneme encoder based on HMM-NN. In [17, 18], a bottleneck feature encoder was trained to detect the union of phonemes from multiple languages to address multilingual KWS. More recently, [19] showed that multilingual KWS models can benefit from learning shared embedding features. [20] also trained an embedding model using multilingual data which is shown to generalize to unseen languages in few-shot setup. A common theme in these related works was the idea of learning shared representations that can generalize across locales, but not explicitly providing the locale information as an input. In [21, 22], the authors showed that conditioning an ASR model with a one-hot locale encoding information is highly effective for multilingual generalization of the ASR model. With this work serving as inspiration, we proposes new locale-conditioned KWS approaches, which use shared parameters across different locales and also allow conditioning by an auxiliary locale encoding input.

2.2 Baseline Model Architecture

For all models in this paper, we use an encoder-decoder architecture consisting of an encoder network (4 convolutional layers) followed by a decoder network (3 convolutional layers) [2, 3, 4]. Note the convolution layers here are simplified versions described more fully in [23, 2]. We refer to the connection between encoder and decoder networks as the bottleneck layer, which projects the final encoder convolution layer output to a dimension matching the input to decoder network. We use $P$ to describe the encoder logits which are the outputs of this bottleneck layer.

The baseline encoder-decoder model is trained in a supervised manner, with training examples $(x,y)$ where $x$ is a sequence of input spectral feature vectors, and $y$ is a sequence of target labels for encoder and decoder logits. We use cross-entropy loss following [2] for the baseline and proposed models.

3 Proposed Method

Fig.1 summarizes 3 different scaling approaches to multilingual keyword spotting problem. Fig.1 (a) shows a simple repetition approach where we repeat training of $N$ individual models using localized data for each locale. Fig.1 (b) shows a fully shared model approach where we train a single model with training data from all locales. Fig.1 (c) shows a locale conditioned model approach where we train a single model with training data from all locales and corresponding locale information. Throughout this paper, we denote $(X,Y)$ as a composite sequence defined as $((x_{i},y_{i})|i=1..k)$ , given original feature sequence $X=(x_{i}|i=1..k)$ and label sequence $Y=(y_{i}|i=1..k)$ of length $k$ . $X_{l}$ and $Y_{l}$ denote feature and label sequences from locale $l$ respectively.

3.1 Locale Specific Models

A locale specific model for locale $l$ can be defined as $M_{l}=f(x;\;\theta_{l})$ where $x$ is the input features, and $\theta_{l}$ is the set of trainable parameters for each $M_{l}$ (See Fig.1-a). Such models can be trained by minimizing the expected losses per each model,

[TABLE]

Here, we use $E_{\ast}[\ldots]$ to denote expectation over $\ast$ .

3.2 Fully Shared Universal Model

A fully shared universal model can be defined as $M_{\tt univ}=f(x;\;\theta_{\tt univ})$ where $\theta_{\tt univ}$ is a single set of parameters shared across all locales (see Fig.1-b). In this case, there will be only a single model trained on the pooled data,

[TABLE]

3.3 Locale Conditioned Universal Model

A locale conditioned model can be described similarly to the previously defined universal model, $M_{\tt cond}=f(x,l;\theta_{\tt cond})$ , except for the additional locale encoding input, $l$ . As with $\theta_{\tt univ}$ , the $\theta_{\tt cond}$ are shared across all locales (see Fig.1-c).

[TABLE]

We experiment with two methods for locale conditioning: concatenation ( ${\tt Concat}$ ), and modulation ( ${\tt FiLM}$ ).

3.3.1 Concatenation

As shown in Fig.2 (middle), each training example (i.e. utterance) comes with a locale index, $l\in\{1..N\}$ , denoting the utterance locale origin. We represent locales as one-hot vectors, $L$ , with length $N$ . The locale index position will have value 1 and elsewhere are 0. With the ${\tt Concat}$ approach, we simply concatenate $L$ with the encoder logits, $P$ , and use the resulting combined tensor as the input to the decoder network. The extra size introduced to the model is $N\times D_{1}$ where $D_{1}$ is the first decoder layer input dimension.

3.3.2 Modulation

Another approach to conditioning is to use the locale information to modulate the existing encoder logits. Feature-wise Linear Modulation ( ${\tt FiLM}$ ) [15] learns to adaptively influence the neural network output by applying an affine transform to the network’s intermediate features based on an external input. In this case, the external input is the one-hot encoded locale $L$ as defined in Section 3.3.1, and the intermediate features are the above mentioned encoder logits $P$ .

Formally, ${\tt FiLM}$ learns element-wise modulation (scale) and bias (shift) functions which can be implemented as simple learnable projection layers, $f$ and $h$ , respectively. As shown in Fig. 2 (right), $f$ projects $L$ to create a modulation factor, $\gamma$ , and $h$ projects $L$ to create a bias factor, $\beta$ . Both $\gamma$ and $\beta$ have the same dimension as $P$ , the encoder logits. The conditioning is then applied on $P$ to produce the modulated input to the decoder network,

[TABLE]

where $\cdot$ denotes elementwise multiplication. The number of extra parameters introduced with $f$ and $h$ is $2\times M\times N$ where $M$ is the size of encoder logits and $N$ is the number of locales.

4 Experimental setup

4.1 Train and Eval Datasets

Our dataset consists of 1.2 billion anonymized utterances from real-world queries collected in accordance with Google Privacy and AI principles [24, 25]. Utterances containing localized versions of keyword phrase “Ok Google” or “Hey Google” are referred to as positive data. Negative utterances are collected from queries triggered by tactile (push-button) and hence do not contain the keyword phrases.

Data is divided into training and evaluation sets as in [4]. The training set consists of 838M positive and 435M negative utterances. The training set has been supplemented with different transformations and noises producing 25 new replicas for each utterance [2, 26]. The evaluation set contains 6M positive utterances divided into 5M regular positive utterances and 1M challenging positive utterances based on SNR. The evaluation set also contains 5k hours of negative audio used to determine the operating point threshold needed for a given FA/hour level.

4.2 Metrics

We evaluate False Reject Rate (FRR), which is the number of positive utterances rejected by the model divided by the number of positive utterances. Similarly we measure False Acceptance Per Hour (FAh) which is the number of false detections divided by the number of negative audio hours. We measure FRR in positive sets with various noise conditions and FAh in negative sets at various threshold levels as mentioned in Section 4.1. We compare model performances by FRR values at a consistent FAh level (0.17/hour) and their FRR-FAh curve plots with the area of interest 0-0.5 FAh.

4.3 Training Details

•

Locales: The locales in our experiments include : DA-DK(Danish), DE-DE (German), ES-ES (Spanish), FR-FR (French), IT-IT (Italian), KO-KR (Korean), NL-NL (Dutch), PT-BR (Brazilian Portuguese), SV-SE (Swedish), TH-TH (Thai).

•

Target keywords: Localized versions of ‘OK Google’ and ‘Hey Google’.

•

Train environment: Tensorflow/Lingvo [27] is used.

•

Input feature and labels: Stacked 40d spectral energy is used as the input feature $x$ . Target labels $y$ are derived from force-alignment as described in [2]. We merge similarly sounding phonemes over different languages for encoder labels.

•

Train loss/steps: Cross entropy is used as loss for encoder and decoder following [2]. We train for 3M $\sim$ 8M steps until loss converges to a stable level.

•

Model size: We experiment with various model sizes: Regular ( $M_{R}$ ) with 330K parameters [2], Large ( $M_{L}$ ) with 1.4M parameters, and XLarge ( $M_{XL}$ ) with 2.4M parameters.

•

Dimension of encoder logits: $|P|$ =32

5 Results

As discussed above, the baseline models include locale specific models, with each independently trained with their own locale data, and a fully shared universal model trained with mixed locale data. The experimental models integrated the locale encoding into the network in through ${\tt Concat}$ and ${\tt FiLM}$ approaches mentioned in Section 3.3. FRRs are reported based on the threshold determined at a constant FAh mentioned in Section 4.2.

Table 1 shows FRR results for the above models in both regular and challenging acoustic condition evaluation sets. We also plot FRR-FAh curves for selected locales in Fig 3. Noticeably, Some locales have bad locale specific models due to the low volume or poor label quality of the data. This is particularly true of smaller locales such as DA_DK and SV_SE, which have an order of magnitude of less training data than large locales such as ES_ES and TH_TH. We observe improved metrics on these underrepresented locales with the universal model because the paucity of training data is compensated for by data from other locales, leading to less over-fitting.

For many other locales, the universal model has poorer results than the locale specific model, most likely because less relevant training data from other locales can harm specialization when there is sufficient data to get reasonable locale-specific performance.

In both acoustic conditions, locale-conditioned models achieved significantly better results than locale-specific models and the universal model. On the one hand, the locale encoding models enhanced the training data volume from cross locale mixing. But unlike the universal model, the locale encoding networks is trained to selectively focus on data relevant to the locale whose input is provided.

We find that ${\tt FiLM}$ consistently outperforms ${\tt Concat}$ , and conjecture that it is due to improved efficiency in learning locale similarities. In the concatenative approach, the locale encoding network is trained with the combined input consisting of encoder logits and locale encoding so that forward propagation is interfered undesirably. In ${\tt FiLM}$ , the locale encoding network takes only locale encoding as input and the learned weights can be used to compare locale similarities by calculating correlation matrix.

Large model $M_{L}$ gives enhanced learning capacity hence the locale specific model and universal model have boosted results in Table 2. The locale encoding approaches further improved the results for different acoustic conditions. We also experimented with an even bigger size $M_{XL}$ , but models in general have slight worse results than $M_{L}$ , possibly due to over-fitting.

6 Conclusion

This paper introduced two new approaches for training multilingual KWS models - locale conditioned universal model with concatenation and ${\tt FiLM}$ modulation approaches. Experiments show that both approaches significantly outperform locale-specific models and the fully shared universal model across 10 different language datasets in various acoustic conditions. Experiments with larger model sizes also show consistent improvements by the proposed approaches. The ${\tt FiLM}$ approach achieves the best results given the efficient way learning cross-locale similarities. Result in Table 1 shows that ${\tt FiLM}$ based approach reduced average FRR by as much as 61% relatively compared to locale specific models. The idea can be extended to other cross domain scenarios to utilize data efficiently when training.

Bibliography27

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] T. Sainath and C. Parada, “Convolutional neural networks for small-footprint keyword spotting.,” in Proceedings of Annual Conference of the International Speech Communication Association (Interspeech) , 2015, pp. 1478–1482.
2[2] Raziel Alvarez and Hyun Jin Park, “End-to-end Streaming Keyword Spotting,” ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 6336–6340, 2019.
3[3] Hyun Jin Park, Patrick Violette, and Niranjan Subrahmanya, “Learning to detect keyword parts and whole by smoothed max pooling,” ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 7899 – 7903, 2020.
4[4] Hyun Jin Park, Pai Zhu, Ignacio Lopez Moreno, and Niranjan Subrahmanya, “Noisy student-teacher training for robust keyword spotting,” Interspeech 2021 , pp. 331–335, 2021.
5[5] S. Panchapagesan, M. Sun, A. Khare, S. Matsoukas, A. Mandal, B. Hoffmeister, and S. Vitaladevuni, “Multi-task learning and weighted cross-entropy for DNN-based keyword spotting,” in Interspeech , 2016.
6[6] Siddharth Sigtia, John Bridle, Hywel Richards, Pascal Clark, Erik Marchi, and Vineet Garg, “Progressive voice trigger detection: Accuracy vs latency,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2021, pp. 6843–6847.
7[7] Siri Team, “Hey Siri: An On-device DNN-powered Voice Trigger for Apple’s Personal Assistant,” https://machinelearning.apple.com/2017/10/01/hey-siri.html , 2017, Accessed: 2018-10-06.
8[8] Geng-Shen Fu, Thibaud Senechal, Aaron Challenner, and Tao Zhang, “Unified speculation, detection, and verification keyword spotting,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2022, pp. 7557–7561.